Data Analysis - Airbnb Melbourne Australia¶

Airbnb, a global online marketplace headquartered in San Francisco, California, was founded in 2008 and operates in over 191 countries. It enables individuals to monetize their spaces by renting them out to travelers.

Through Airbnb’s platform, hosts can offer guests short-term lodging and tourism-related activities. Guests have the flexibility to search for accommodations based on type, dates, location, and price, and can even seek out specific types of homes, such as bed and breakfasts, unique homes, and vacation homes.

Known for its diverse accommodations, Airbnb boasts a variety of properties ranging from single rooms and apartments to moored yachts, houseboats, entire islands, and even castles.

By offering unique and personalized experiences, Airbnb has transformed the travel industry. A prime example of this success is Melbourne, Australia.

This city, celebrated for its cultural diversity, blend of modern and historical architecture, lively arts scene, and passion for good food and coffee, perfectly embodies Airbnb’s ethos of local experiences and personal connections. The wide variety of unique accommodations available in Melbourne further enhances its appeal to travelers.

In light of this, I intend to conduct an analysis of Melbourne Airbnb Data. This dataset contains information about hosts, neighbourhoods, room types, prices, reviews, ratings, etc. Through the Inside Airbnb website, it is possible to collect this dataset and you can even find Airbnb data for other cities around the world.

This analysis aims to explore the data, perform necessary data cleaning steps to prepare it for analysis, and then delve into the data to reveal trends and patterns. The goal is to provide valuable insights into how Airbnb is utilized in Melbourne and understand its implications for hosts, guests, and the wider community.

With that in mind, I have defined the problem statement as the following: Analyze the key characteristics of Airbnb listings in Melbourne to gain a comprehensive understanding of the local rental market.

In alignement with the problem statement, the following research questions have been formulated:

  1. Who are the top 10 hosts with the most listings?

  2. Where are the top 10 Melbourne Airbnbs listings located?

  3. How are the room type distributed in the listings?

  4. What are the 5 most common numbers of bedrooms in the listings?

  5. What are the 5 most frequent minimum nights?

  6. What are the four amenities most commonly provided?

  7. Which room type in Melbourne's Airbnb listings has the highest number of reviews?

  8. What is the average of amenities by room type?

  9. Is the average price per night higher for private room than for shared rooms?

  10. Properties with more reviews might be priced higher due to their popularity?

  11. Listings with more amenities might be priced higher?

1. Setting up the environment¶

I'm about to embark on a journey of data exploration and analysis. To aid me in this quest, I've gathered a set of powerful tools, each with its own unique capabilities. With these tools at my disposal, I'm ready to dive into the world of data and uncover the answers it holds.

In [1]:
# Import the libraries
import pandas as pd # Used for data manipulation and analysis
import numpy as np # Handles numerical operations like mathematical computations
import matplotlib.pyplot as plt # Creates static, animated, and intereactive visualizations
import seaborn as sns # Makes statistical graphics
import plotly.express as px # Generate interactive plots
import plotly.graph_objects as go # Generate interactive plots
import statistics as st # Provides functions to perform statistical operations
import geopandas as gpd # Deals with geospatial data
import ast # Allows Python code to be parsed into an abstract syntax tree (AST)
from collections import Counter # Keeps track of the frequency of elements in a collection
import warnings # Used to warn the user about situations that aren't necessarily exceptions

This code is setting up the environment for data analysis and visualization. These settings help in making the data analysis process more efficient and the output more readable.

In [2]:
# Set up the pandas
pd.set_option('display.max_columns', None) # Display all columns in DataFrame.
pd.set_option('display.max_rows', None) # Display all rows in DataFrame.

# Set up the matplotlib
plt.style.use('ggplot') # Apply 'ggplot' style to plots.

# Set up warnings
warnings.filterwarnings('ignore') # Ignore all warning messages.

2. Data Collection¶

I'm going to import the data from a CSV file into a DataFrame, preparing it for our upcoming analysis.

In [3]:
# Import DataFrame
data = pd.read_csv("/kaggle/input/melbourne-airbnb-september-2023-dataset/listings.csv")

3. Data Exploration¶

This stage involves three main steps:

  • Understanding the DataFrame: This involves getting to know the structure of the DataFrame, its columns (which represent the variables or features in the data), and the type of data each column holds.

  • Descriptive Statistics: This is where we start to dive deeper into the data. We calculate summary statistics like mean, median, mode, standard deviation etc., for each feature in the DataFrame. This gives us a sense of the distribution of values and can help identify trends and patterns.

  • Data Quality Assessment: In this step, we check the quality of our data. We look for missing values, outliers, or inconsistent data entries, and decide to handle them. This ensures that our analysis is based on clean, reliable data.

These steps provide a solid foundation for any data analysis task. They help us understand what our data looks like and ensure it's in good shape for further analysis.

Understanding the DataFrame¶

Using the .shape attribute we can see a tuple representing the dimensionality of the DataFrame. The first element of the tuple will give you the number of rows (23185) and the second element will give you the number of columns (75). This is a quick way to know how large your dataset is.

In [4]:
# Check the DataFrame dimension
data.shape
Out[4]:
(23185, 75)

The .dtypes attribute returns a Series with the data type of each column. The result will be in the format column_name: data_type. This is useful to know because the data type of the columns can affect what operations you can perform on them.

In [5]:
# Check data types
data.dtypes
Out[5]:
id                                                int64
listing_url                                      object
scrape_id                                         int64
last_scraped                                     object
source                                           object
name                                             object
description                                      object
neighborhood_overview                            object
picture_url                                      object
host_id                                           int64
host_url                                         object
host_name                                        object
host_since                                       object
host_location                                    object
host_about                                       object
host_response_time                               object
host_response_rate                               object
host_acceptance_rate                             object
host_is_superhost                                object
host_thumbnail_url                               object
host_picture_url                                 object
host_neighbourhood                               object
host_listings_count                             float64
host_total_listings_count                       float64
host_verifications                               object
host_has_profile_pic                             object
host_identity_verified                           object
neighbourhood                                    object
neighbourhood_cleansed                           object
neighbourhood_group_cleansed                    float64
latitude                                        float64
longitude                                       float64
property_type                                    object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bathrooms_text                                   object
bedrooms                                        float64
beds                                            float64
amenities                                        object
price                                            object
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                            int64
maximum_minimum_nights                            int64
minimum_maximum_nights                            int64
maximum_maximum_nights                            int64
minimum_nights_avg_ntm                          float64
maximum_nights_avg_ntm                          float64
calendar_updated                                float64
has_availability                                 object
availability_30                                   int64
availability_60                                   int64
availability_90                                   int64
availability_365                                  int64
calendar_last_scraped                            object
number_of_reviews                                 int64
number_of_reviews_ltm                             int64
number_of_reviews_l30d                            int64
first_review                                     object
last_review                                      object
review_scores_rating                            float64
review_scores_accuracy                          float64
review_scores_cleanliness                       float64
review_scores_checkin                           float64
review_scores_communication                     float64
review_scores_location                          float64
review_scores_value                             float64
license                                         float64
instant_bookable                                 object
calculated_host_listings_count                    int64
calculated_host_listings_count_entire_homes       int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               float64
dtype: object

This line of code is using the .head() method to display, in this case, the first 3 rows of the DataFrame 'data'. This is a quick way to preview the data and get a sense of what it looks like.

In [6]:
# Preview the first 3 rows
data.head(3)
Out[6]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id host_url host_name host_since host_location host_about host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed latitude longitude property_type room_type accommodates bathrooms bathrooms_text bedrooms beds amenities price minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated has_availability availability_30 availability_60 availability_90 availability_365 calendar_last_scraped number_of_reviews number_of_reviews_ltm number_of_reviews_l30d first_review last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 10803 https://www.airbnb.com/rooms/10803 20230904181306 2023-09-05 city scrape Rental unit in Brunswick East · ★4.49 · 1 bedr... A large air conditioned room with firm queen s... This hip area is a crossroads between two grea... https://a0.muscache.com/pictures/e5f30dd1-ac57... 38901 https://www.airbnb.com/users/show/38901 Lindsay 2009-09-16 Melbourne, Australia As an artist working in animation and video I ... within a few hours 100% 95% f https://a0.muscache.com/im/pictures/user/6860c... https://a0.muscache.com/im/pictures/user/6860c... Brunswick 1.0 1.0 ['email', 'phone'] t t Brunswick East, Victoria, Australia Moreland NaN -37.76606 144.97951 Private room in rental unit Private room 2 NaN 1 shared bath NaN 1.0 ["Microwave", "Hot water kettle", "Drying rack... $49.00 5 14 1 5 14 14 5.0 14.0 NaN t 4 13 35 201 2023-09-05 173 21 0 2013-01-12 2023-07-19 4.49 4.65 3.98 4.72 4.69 4.66 4.61 NaN f 1 0 1 0 1.33
1 12936 https://www.airbnb.com/rooms/12936 20230904181306 2023-09-05 previous scrape Rental unit in St Kilda · ★4.68 · 1 bedroom · ... RIGHT IN THE HEART OF ST KILDA! It doesn't get... A stay at our apartment means you can enjoy so... https://a0.muscache.com/pictures/59701/2e8cdaf... 50121 https://www.airbnb.com/users/show/50121 The A2C Team 2009-10-31 Melbourne, Australia NaN NaN NaN NaN f https://a0.muscache.com/im/pictures/user/3894a... https://a0.muscache.com/im/pictures/user/3894a... St Kilda 10.0 20.0 ['email', 'phone', 'work_email'] t t St Kilda, Victoria, Australia Port Phillip NaN -37.85999 144.97662 Entire rental unit Entire home/apt 2 NaN 1 bath 1.0 1.0 ["Microwave", "Lockbox", "Hot water", "Smoke a... $95.00 3 14 3 3 14 14 3.0 14.0 NaN t 0 0 0 0 2023-09-05 42 0 0 2010-08-04 2020-03-15 4.68 4.78 4.71 4.83 4.83 4.78 4.66 NaN f 10 10 0 0 0.26
2 38271 https://www.airbnb.com/rooms/38271 20230904181306 2023-09-04 city scrape Rental unit in Berwick · ★4.86 · 3 bedrooms · ... No children under 13 will be accepted in your ... Our street is quiet & secluded but within walk... https://a0.muscache.com/pictures/1182791/3bf4b... 164193 https://www.airbnb.com/users/show/164193 Daryl & Dee 2010-07-12 Berwick, Australia We are an active couple who work from home and... within an hour 100% 91% t https://a0.muscache.com/im/pictures/user/29c38... https://a0.muscache.com/im/pictures/user/29c38... NaN 1.0 1.0 ['email', 'phone', 'work_email'] t t Berwick, Victoria, Australia Casey NaN -38.05723 145.33982 Entire rental unit Entire home/apt 5 NaN 1 bath 3.0 3.0 ["Microwave", "Hot water kettle", "Drying rack... $116.00 1 14 1 1 14 14 1.0 14.0 NaN t 18 40 64 325 2023-09-04 228 36 3 2010-11-24 2023-08-26 4.86 4.92 4.98 4.91 4.94 4.90 4.88 NaN f 1 1 0 0 1.47

Here I use .tail() method to display the last 3 rows of the DataFrame 'data'. This is a quick way to see the most recent entries in your data.

In [7]:
# Preview the last 3 rows
data.tail(3)
Out[7]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id host_url host_name host_since host_location host_about host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed latitude longitude property_type room_type accommodates bathrooms bathrooms_text bedrooms beds amenities price minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated has_availability availability_30 availability_60 availability_90 availability_365 calendar_last_scraped number_of_reviews number_of_reviews_ltm number_of_reviews_l30d first_review last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
23182 971613881418926837 https://www.airbnb.com/rooms/971613881418926837 20230904181306 2023-09-05 city scrape Home in Abbotsford · ★New · 4 bedrooms · 4 bed... Cả nhóm sẽ dễ dàng tiếp cận mọi địa điểm từ ch... NaN https://a0.muscache.com/pictures/hosting/Hosti... 535127590 https://www.airbnb.com/users/show/535127590 Trần 2023-09-02 Abbotsford, Australia NaN NaN NaN NaN f https://a0.muscache.com/im/pictures/user/User-... https://a0.muscache.com/im/pictures/user/User-... NaN 1.0 1.0 ['email', 'phone'] t t NaN Yarra NaN -37.807580 145.000010 Entire home Entire home/apt 3 NaN 1 bath 4.0 4.0 ["Wifi", "Kitchen", "Free parking on premises"... $214.00 1 365 1 1 365 365 1.0 365.0 NaN f 29 59 89 269 2023-09-05 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN f 1 1 0 0 NaN
23183 971616602219608699 https://www.airbnb.com/rooms/971616602219608699 20230904181306 2023-09-04 city scrape Rental unit in Box Hill · ★New · 2 bedrooms · ... Enjoy easy access to everything from this perf... NaN https://a0.muscache.com/pictures/hosting/Hosti... 163554327 https://www.airbnb.com/users/show/163554327 Steven 2017-12-21 Box Hill South, Australia NaN within an hour 100% 99% f https://a0.muscache.com/im/pictures/user/70c58... https://a0.muscache.com/im/pictures/user/70c58... NaN 12.0 21.0 ['email', 'phone'] t t NaN Whitehorse NaN -37.818577 145.116941 Entire rental unit Entire home/apt 4 NaN 2 baths 2.0 3.0 ["Pool", "Wifi", "Kitchen", "Hot tub", "BBQ gr... $118.00 1 365 1 1 365 365 1.0 365.0 NaN t 30 60 90 179 2023-09-04 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN t 11 11 0 0 NaN
23184 971632586652679242 https://www.airbnb.com/rooms/971632586652679242 20230904181306 2023-09-04 city scrape Home in Springvale · ★New · Studio · 1 bed · 0... The room is spacious with built in robe and a ... NaN https://a0.muscache.com/pictures/hosting/Hosti... 395737294 https://www.airbnb.com/users/show/395737294 Ratana 2021-04-06 Springvale, Australia I am a professional engineer, running my own e... within an hour 100% 63% f https://a0.muscache.com/im/pictures/user/c3e11... https://a0.muscache.com/im/pictures/user/c3e11... NaN 3.0 3.0 ['email', 'phone'] t t NaN Greater Dandenong NaN -37.951797 145.157050 Private room in home Private room 2 NaN 0 shared baths NaN 1.0 ["Security cameras on property", "Smoke alarm"... $80.00 1 365 1 1 365 365 1.0 365.0 NaN t 4 34 64 339 2023-09-04 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN f 3 0 3 0 NaN

Descriptive Statistics¶

Using .describe() method generates a summary statistics of the DataFrame. This method returns count, mean, standard deviation, minimum and maximum values, and the quartile of the data for each column. It's a quick way to understand the distribution of each column in your data.

In [8]:
# Check summary statistics
data.describe()
Out[8]:
id scrape_id host_id host_listings_count host_total_listings_count neighbourhood_group_cleansed latitude longitude accommodates bathrooms bedrooms beds minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated availability_30 availability_60 availability_90 availability_365 number_of_reviews number_of_reviews_ltm number_of_reviews_l30d review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value license calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
count 2.318500e+04 2.318500e+04 2.318500e+04 23183.000000 23183.000000 0.0 23185.000000 23185.000000 23185.000000 0.0 16888.000000 23002.000000 23185.000000 23185.000000 23185.000000 23185.000000 2.318500e+04 2.318500e+04 23185.000000 2.318500e+04 0.0 23185.000000 23185.000000 23185.000000 23185.000000 23185.000000 23185.000000 23185.000000 18675.000000 18429.000000 18433.000000 18427.000000 18432.000000 18427.000000 18427.000000 0.0 23185.000000 23185.000000 23185.000000 23185.000000 18675.000000
mean 3.472407e+17 2.023090e+13 1.607668e+08 28.485140 49.072510 NaN -37.828206 145.013870 3.659521 NaN 1.974420 2.127858 5.305499 560.395212 4.851369 6.089411 1.019553e+06 1.019580e+06 5.466875 1.019567e+06 NaN 11.431011 25.594609 40.532327 137.641320 30.803666 9.655769 0.716066 4.643992 4.748298 4.663471 4.779708 4.811204 4.824680 4.659304 NaN 17.263274 12.863403 4.246927 0.101100 1.370684
std 4.043073e+17 4.847761e+00 1.585790e+08 101.397683 199.137068 NaN 0.074543 0.150682 2.357757 NaN 1.076814 1.759124 29.116198 806.061758 25.774833 28.565941 4.676585e+07 4.676585e+07 26.640337 4.676585e+07 NaN 11.490477 23.304951 35.290262 131.909717 60.806538 17.740837 1.465085 0.696825 0.440869 0.493951 0.415675 0.410474 0.332126 0.470684 NaN 49.074356 33.107832 29.576002 0.967221 1.633770
min 1.080300e+04 2.023090e+13 7.834000e+03 1.000000 1.000000 NaN -38.225690 144.518047 1.000000 NaN 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000e+00 1.000000e+00 1.000000 1.000000e+00 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 NaN 1.000000 0.000000 0.000000 0.000000 0.010000
25% 2.359906e+07 2.023090e+13 3.039392e+07 1.000000 1.000000 NaN -37.854080 144.956622 2.000000 NaN 1.000000 1.000000 1.000000 90.000000 1.000000 2.000000 1.800000e+02 3.650000e+02 1.000000 3.650000e+02 NaN 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 4.600000 4.700000 4.550000 4.740000 4.800000 4.790000 4.570000 NaN 1.000000 1.000000 0.000000 0.000000 0.240000
50% 4.792597e+07 2.023090e+13 9.879526e+07 2.000000 3.000000 NaN -37.819440 144.977240 3.000000 NaN 2.000000 2.000000 2.000000 365.000000 2.000000 2.000000 1.125000e+03 1.125000e+03 2.000000 1.125000e+03 NaN 8.000000 25.000000 42.000000 91.000000 7.000000 2.000000 0.000000 4.830000 4.890000 4.820000 4.920000 4.950000 4.920000 4.780000 NaN 2.000000 1.000000 0.000000 0.000000 0.880000
75% 7.894182e+17 2.023090e+13 2.628817e+08 10.000000 15.000000 NaN -37.802090 145.022480 4.000000 NaN 2.000000 3.000000 3.000000 1125.000000 3.000000 5.000000 1.125000e+03 1.125000e+03 3.700000 1.125000e+03 NaN 23.000000 49.000000 75.000000 267.000000 32.000000 13.000000 1.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 4.940000 NaN 8.000000 7.000000 1.000000 0.000000 1.990000
max 9.729284e+17 2.023090e+13 5.351276e+08 849.000000 1755.000000 NaN -37.484800 145.844997 16.000000 NaN 14.000000 117.000000 1125.000000 99999.000000 1000.000000 1125.000000 2.147484e+09 2.147484e+09 1000.000000 2.147484e+09 NaN 30.000000 60.000000 90.000000 365.000000 987.000000 922.000000 40.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 NaN 327.000000 250.000000 251.000000 20.000000 44.060000

Data Quality Assessment¶

With .info() method we get a concise summary of the DataFrame. That method prints information about the DataFrame including the index, columns, non-null values, dtype and memory usage. It's a quick way to understand the structure of your data, including the number of entries, the data type of each column, and the amount of memory used.

In [9]:
# Check DataFrame information
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23185 entries, 0 to 23184
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            23185 non-null  int64  
 1   listing_url                                   23185 non-null  object 
 2   scrape_id                                     23185 non-null  int64  
 3   last_scraped                                  23185 non-null  object 
 4   source                                        23185 non-null  object 
 5   name                                          23185 non-null  object 
 6   description                                   22822 non-null  object 
 7   neighborhood_overview                         13256 non-null  object 
 8   picture_url                                   23185 non-null  object 
 9   host_id                                       23185 non-null  int64  
 10  host_url                                      23185 non-null  object 
 11  host_name                                     23183 non-null  object 
 12  host_since                                    23183 non-null  object 
 13  host_location                                 17286 non-null  object 
 14  host_about                                    12833 non-null  object 
 15  host_response_time                            15418 non-null  object 
 16  host_response_rate                            15418 non-null  object 
 17  host_acceptance_rate                          16579 non-null  object 
 18  host_is_superhost                             22468 non-null  object 
 19  host_thumbnail_url                            23183 non-null  object 
 20  host_picture_url                              23183 non-null  object 
 21  host_neighbourhood                            8772 non-null   object 
 22  host_listings_count                           23183 non-null  float64
 23  host_total_listings_count                     23183 non-null  float64
 24  host_verifications                            23185 non-null  object 
 25  host_has_profile_pic                          23183 non-null  object 
 26  host_identity_verified                        23183 non-null  object 
 27  neighbourhood                                 13257 non-null  object 
 28  neighbourhood_cleansed                        23185 non-null  object 
 29  neighbourhood_group_cleansed                  0 non-null      float64
 30  latitude                                      23185 non-null  float64
 31  longitude                                     23185 non-null  float64
 32  property_type                                 23185 non-null  object 
 33  room_type                                     23185 non-null  object 
 34  accommodates                                  23185 non-null  int64  
 35  bathrooms                                     0 non-null      float64
 36  bathrooms_text                                23174 non-null  object 
 37  bedrooms                                      16888 non-null  float64
 38  beds                                          23002 non-null  float64
 39  amenities                                     23185 non-null  object 
 40  price                                         23185 non-null  object 
 41  minimum_nights                                23185 non-null  int64  
 42  maximum_nights                                23185 non-null  int64  
 43  minimum_minimum_nights                        23185 non-null  int64  
 44  maximum_minimum_nights                        23185 non-null  int64  
 45  minimum_maximum_nights                        23185 non-null  int64  
 46  maximum_maximum_nights                        23185 non-null  int64  
 47  minimum_nights_avg_ntm                        23185 non-null  float64
 48  maximum_nights_avg_ntm                        23185 non-null  float64
 49  calendar_updated                              0 non-null      float64
 50  has_availability                              23185 non-null  object 
 51  availability_30                               23185 non-null  int64  
 52  availability_60                               23185 non-null  int64  
 53  availability_90                               23185 non-null  int64  
 54  availability_365                              23185 non-null  int64  
 55  calendar_last_scraped                         23185 non-null  object 
 56  number_of_reviews                             23185 non-null  int64  
 57  number_of_reviews_ltm                         23185 non-null  int64  
 58  number_of_reviews_l30d                        23185 non-null  int64  
 59  first_review                                  18675 non-null  object 
 60  last_review                                   18675 non-null  object 
 61  review_scores_rating                          18675 non-null  float64
 62  review_scores_accuracy                        18429 non-null  float64
 63  review_scores_cleanliness                     18433 non-null  float64
 64  review_scores_checkin                         18427 non-null  float64
 65  review_scores_communication                   18432 non-null  float64
 66  review_scores_location                        18427 non-null  float64
 67  review_scores_value                           18427 non-null  float64
 68  license                                       0 non-null      float64
 69  instant_bookable                              23185 non-null  object 
 70  calculated_host_listings_count                23185 non-null  int64  
 71  calculated_host_listings_count_entire_homes   23185 non-null  int64  
 72  calculated_host_listings_count_private_rooms  23185 non-null  int64  
 73  calculated_host_listings_count_shared_rooms   23185 non-null  int64  
 74  reviews_per_month                             18675 non-null  float64
dtypes: float64(20), int64(21), object(34)
memory usage: 13.3+ MB

The .columns attribute returns the names of the columns in the DataFrame. It's an efficient method to identify all the features or variables you have in your data.

In [10]:
# Check DataFrame column names
data.columns
Out[10]:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review',
       'last_review', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'license', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object')

The .nunique() method identifies the number of unique values in each column of the DataFrame. This can be useful to understand the diversity of values in each variable in the DataFrame.

In [11]:
# Identify unique values by each column
data.nunique()
Out[11]:
id                                              23185
listing_url                                     23185
scrape_id                                           1
last_scraped                                        2
source                                              2
name                                            16587
description                                     21358
neighborhood_overview                           10985
picture_url                                     22651
host_id                                         13395
host_url                                        13395
host_name                                        5353
host_since                                       3942
host_location                                     957
host_about                                       6468
host_response_time                                  4
host_response_rate                                 62
host_acceptance_rate                               94
host_is_superhost                                   2
host_thumbnail_url                              13074
host_picture_url                                13074
host_neighbourhood                                117
host_listings_count                                97
host_total_listings_count                         135
host_verifications                                  8
host_has_profile_pic                                2
host_identity_verified                              2
neighbourhood                                     515
neighbourhood_cleansed                             30
neighbourhood_group_cleansed                        0
latitude                                        16261
longitude                                       16767
property_type                                      83
room_type                                           4
accommodates                                       16
bathrooms                                           0
bathrooms_text                                     38
bedrooms                                           12
beds                                               23
amenities                                       21892
price                                             924
minimum_nights                                     83
maximum_nights                                    192
minimum_minimum_nights                             79
maximum_minimum_nights                             88
minimum_maximum_nights                            176
maximum_maximum_nights                            174
minimum_nights_avg_ntm                            300
maximum_nights_avg_ntm                            601
calendar_updated                                    0
has_availability                                    2
availability_30                                    31
availability_60                                    61
availability_90                                    91
availability_365                                  366
calendar_last_scraped                               2
number_of_reviews                                 442
number_of_reviews_ltm                             137
number_of_reviews_l30d                             20
first_review                                     3168
last_review                                      2172
review_scores_rating                              163
review_scores_accuracy                            157
review_scores_cleanliness                         187
review_scores_checkin                             168
review_scores_communication                       152
review_scores_location                            116
review_scores_value                               162
license                                             0
instant_bookable                                    2
calculated_host_listings_count                     63
calculated_host_listings_count_entire_homes        62
calculated_host_listings_count_private_rooms       19
calculated_host_listings_count_shared_rooms         9
reviews_per_month                                 822
dtype: int64

Using .isnull().sum() method gives you a summary of how many missing values there are in each column in the DataFrame. It's a crucial first step in understanding the quality and completeness of the data.

In [12]:
# Check for missing values
data.isnull().sum()
Out[12]:
id                                                  0
listing_url                                         0
scrape_id                                           0
last_scraped                                        0
source                                              0
name                                                0
description                                       363
neighborhood_overview                            9929
picture_url                                         0
host_id                                             0
host_url                                            0
host_name                                           2
host_since                                          2
host_location                                    5899
host_about                                      10352
host_response_time                               7767
host_response_rate                               7767
host_acceptance_rate                             6606
host_is_superhost                                 717
host_thumbnail_url                                  2
host_picture_url                                    2
host_neighbourhood                              14413
host_listings_count                                 2
host_total_listings_count                           2
host_verifications                                  0
host_has_profile_pic                                2
host_identity_verified                              2
neighbourhood                                    9928
neighbourhood_cleansed                              0
neighbourhood_group_cleansed                    23185
latitude                                            0
longitude                                           0
property_type                                       0
room_type                                           0
accommodates                                        0
bathrooms                                       23185
bathrooms_text                                     11
bedrooms                                         6297
beds                                              183
amenities                                           0
price                                               0
minimum_nights                                      0
maximum_nights                                      0
minimum_minimum_nights                              0
maximum_minimum_nights                              0
minimum_maximum_nights                              0
maximum_maximum_nights                              0
minimum_nights_avg_ntm                              0
maximum_nights_avg_ntm                              0
calendar_updated                                23185
has_availability                                    0
availability_30                                     0
availability_60                                     0
availability_90                                     0
availability_365                                    0
calendar_last_scraped                               0
number_of_reviews                                   0
number_of_reviews_ltm                               0
number_of_reviews_l30d                              0
first_review                                     4510
last_review                                      4510
review_scores_rating                             4510
review_scores_accuracy                           4756
review_scores_cleanliness                        4752
review_scores_checkin                            4758
review_scores_communication                      4753
review_scores_location                           4758
review_scores_value                              4758
license                                         23185
instant_bookable                                    0
calculated_host_listings_count                      0
calculated_host_listings_count_entire_homes         0
calculated_host_listings_count_private_rooms        0
calculated_host_listings_count_shared_rooms         0
reviews_per_month                                4510
dtype: int64

Here we have a summary of the percentage of missing values.

In [13]:
# Check percentage of missing values
round(data.isnull().sum() / len(data) * 100,2)
Out[13]:
id                                                0.00
listing_url                                       0.00
scrape_id                                         0.00
last_scraped                                      0.00
source                                            0.00
name                                              0.00
description                                       1.57
neighborhood_overview                            42.83
picture_url                                       0.00
host_id                                           0.00
host_url                                          0.00
host_name                                         0.01
host_since                                        0.01
host_location                                    25.44
host_about                                       44.65
host_response_time                               33.50
host_response_rate                               33.50
host_acceptance_rate                             28.49
host_is_superhost                                 3.09
host_thumbnail_url                                0.01
host_picture_url                                  0.01
host_neighbourhood                               62.17
host_listings_count                               0.01
host_total_listings_count                         0.01
host_verifications                                0.00
host_has_profile_pic                              0.01
host_identity_verified                            0.01
neighbourhood                                    42.82
neighbourhood_cleansed                            0.00
neighbourhood_group_cleansed                    100.00
latitude                                          0.00
longitude                                         0.00
property_type                                     0.00
room_type                                         0.00
accommodates                                      0.00
bathrooms                                       100.00
bathrooms_text                                    0.05
bedrooms                                         27.16
beds                                              0.79
amenities                                         0.00
price                                             0.00
minimum_nights                                    0.00
maximum_nights                                    0.00
minimum_minimum_nights                            0.00
maximum_minimum_nights                            0.00
minimum_maximum_nights                            0.00
maximum_maximum_nights                            0.00
minimum_nights_avg_ntm                            0.00
maximum_nights_avg_ntm                            0.00
calendar_updated                                100.00
has_availability                                  0.00
availability_30                                   0.00
availability_60                                   0.00
availability_90                                   0.00
availability_365                                  0.00
calendar_last_scraped                             0.00
number_of_reviews                                 0.00
number_of_reviews_ltm                             0.00
number_of_reviews_l30d                            0.00
first_review                                     19.45
last_review                                      19.45
review_scores_rating                             19.45
review_scores_accuracy                           20.51
review_scores_cleanliness                        20.50
review_scores_checkin                            20.52
review_scores_communication                      20.50
review_scores_location                           20.52
review_scores_value                              20.52
license                                         100.00
instant_bookable                                  0.00
calculated_host_listings_count                    0.00
calculated_host_listings_count_entire_homes       0.00
calculated_host_listings_count_private_rooms      0.00
calculated_host_listings_count_shared_rooms       0.00
reviews_per_month                                19.45
dtype: float64

Below I used .duplicated().sum() to identify duplicate rows in the DataFrame. Here, no duplicates were identified. This is a useful step to spot and handle redundant data that we might want to remove to ensure the quality and accuracy of our analysis. The next stage is Data Cleaning.

In [14]:
# Identify duplicates
data.duplicated().sum()
Out[14]:
0

4. Data Cleaning¶

In the Data Cleaning process, each of the following steps is crucial in ensuring that the DataFrame is clean, consistent, and ready for analysis. They help improve the accuracy and reliability of the analysis.

  • Remove Unnecessary Columns: This step involves eliminating columns that are not required for the analysis. These could be columns that contain redundant or irrelevant information.

  • Handling Missing Values: This step deals with missing or null values in the DataFrame. Strategies to handle them can include filling them with a specific value (like mean, median, or mode) or removing the rows/columns with missing values entirely.

  • Data Type Conversion: Sometimes, the data types of the columns might not be appropriate for the analysis. For example, a numerical column might be stored as strings. This step involves converting such columns to their correct data type.

  • Handling Inconsistent Data: This process corrects any discrepancies and improper formatting to ensure data consistency.

  • Removing Duplicates: This step involves identifying and removing duplicate rows in the DataFrame. Duplicate data can skew the analysis and lead to incorrect conclusions.

  • Outlier Detection and Treatment: Outliers are data points that significantly deviate from other observations. They can be genuine variations or occur due to errors. This step involves identifying these outliers and determining the appropriate treatment, which could involve removal or transformation.

Remove Unnecessary Columns¶

I'm going to review the columns names once more, determine which ones are relevant, and select them to find the answers.

In [15]:
# Check column names
data.columns
Out[15]:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review',
       'last_review', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'license', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object')

Before proceeding with the Data Cleaning process, I will create a backcup of the original DataFrame to preserve our initial data. Then, I will select a subset of columns that are pertinent to our analysis. This approach is crucial in the Data Cleaning process as it allows us to focus on the most significant variables, thereby making the subsequent analysis more efficient and manageable.

In [16]:
# Drop irrelevant columns (making a DataFrame backup!)
data_v2 = data[[
       #      'id', 
       'listing_url',
       #      'scrape_id', 'last_scraped', 'source', 
       'name',
       #      'description', 'neighborhood_overview', 'picture_url', 
       'host_id',
       #      'host_url', 'host_name', 
       #      'host_since', 'host_location', 'host_about',
       #      'host_response_time',
       #      'host_response_rate', 'host_acceptance_rate',
       #      'host_is_superhost', 
       #      'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 
       #      'host_listings_count', 'host_total_listings_count', 
       #      'host_verifications',
       #      'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 
       #      'neighbourhood_group_cleansed', 
       'latitude', 'longitude', 
       #      'property_type', 
       'room_type', 'accommodates', 
       #      'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights',
       #      'maximum_nights', 'minimum_minimum_nights',
       #      'maximum_minimum_nights', 'minimum_maximum_nights',
       #      'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       #      'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       #      'availability_30', 'availability_60', 'availability_90',
       #      'availability_365', 'calendar_last_scraped', 
       'number_of_reviews',
       #      'number_of_reviews_ltm', 'number_of_reviews_l30d', 
       #      'first_review', 'last_review', 'review_scores_rating', 
       #      'review_scores_accuracy',
       #      'review_scores_cleanliness', 'review_scores_checkin',
       #      'review_scores_communication', 'review_scores_location',
       #      'review_scores_value', 
       #      'license', 'instant_bookable', 'calculated_host_listings_count',
       #      'calculated_host_listings_count_entire_homes',
       #      'calculated_host_listings_count_private_rooms',
       #      'calculated_host_listings_count_shared_rooms', 
       #      'reviews_per_month'
       ]]

By using the .shape attribute, we can observe that our DataFrame now consists of 23185 rows and 15 relevant columns mentioned below:

  • 'listing_url': The URL of the listing
  • 'name': The name of the listing
  • 'host_id': The ID of the host
  • 'neighbourhood_cleansed': The cleaned name of the neighbourhood
  • 'latitude': The latitude coordinate of the listing
  • 'longitude': The longitude coordinate of the listing
  • 'room_type': The type of room
  • 'accommodates': The number of people the listing accommodates
  • 'bathrooms_text': The number of bathrooms in text format
  • 'bedrooms': The number of bedrooms
  • 'beds': The number of beds
  • 'amenities': The amenities provided by the listing
  • 'price': The price of the listing
  • 'minimum_nights': The minimum number of nights for the listing
  • 'number_of_reviews': The number of reviews the listing has received
In [17]:
# Check new DataFrame dimension
data_v2.shape
Out[17]:
(23185, 15)

I will use the .head() once again to get a quick overview of the new Dataframe.

In [18]:
# Preview the first 3 rows of the new DataFrame
data_v2.head(3)
Out[18]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
0 https://www.airbnb.com/rooms/10803 Rental unit in Brunswick East · ★4.49 · 1 bedr... 38901 Moreland -37.76606 144.97951 Private room 2 1 shared bath NaN 1.0 ["Microwave", "Hot water kettle", "Drying rack... $49.00 5 173
1 https://www.airbnb.com/rooms/12936 Rental unit in St Kilda · ★4.68 · 1 bedroom · ... 50121 Port Phillip -37.85999 144.97662 Entire home/apt 2 1 bath 1.0 1.0 ["Microwave", "Lockbox", "Hot water", "Smoke a... $95.00 3 42
2 https://www.airbnb.com/rooms/38271 Rental unit in Berwick · ★4.86 · 3 bedrooms · ... 164193 Casey -38.05723 145.33982 Entire home/apt 5 1 bath 3.0 3.0 ["Microwave", "Hot water kettle", "Drying rack... $116.00 1 228

Handling Missing Values¶

By using .isnull().sum() method, we can see that there are 3 columns ('bathrooms_text', 'bedrooms', 'beds') with missing values. Let's proceed to handle these missing values.

In [19]:
# Check for missing values
data_v2.isnull().sum()
Out[19]:
listing_url                  0
name                         0
host_id                      0
neighbourhood_cleansed       0
latitude                     0
longitude                    0
room_type                    0
accommodates                 0
bathrooms_text              11
bedrooms                  6297
beds                       183
amenities                    0
price                        0
minimum_nights               0
number_of_reviews            0
dtype: int64

This line of code is a common way to identify and inspect rows with missing values in a specific column of a DataFrame. Here, I'm examining the 'bathrooms_text' column which has been found to contain 11 missing values.

In [20]:
# Check "bathrooms_text" column missing values
data_v2[data_v2['bathrooms_text'].isnull()]
Out[20]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
120 https://www.airbnb.com/rooms/268620 Home in Caroline Springs · 1 bedroom 1407948 Melton -37.73975 144.73742 Private room 1 NaN NaN NaN [] $72.00 1 0
211 https://www.airbnb.com/rooms/508590 Home in Balwyn North · ★4.63 · 1 bedroom 2255480 Boroondara -37.79370 145.07935 Private room 1 NaN NaN NaN [] $50.00 1 16
6523 https://www.airbnb.com/rooms/26888053 Nature lodge in Mt Dandenong · ★4.78 · 1 bed 156749664 Yarra Ranges -37.82976 145.35228 Hotel room 2 NaN NaN 1.0 ["Clothing storage: closet", "Microwave", "Hot... $325.00 1 41
6537 https://www.airbnb.com/rooms/26904346 Nature lodge in Mt Dandenong · ★5.0 · 1 bedroo... 156749664 Yarra Ranges -37.83003 145.35347 Hotel room 2 NaN 1.0 1.0 ["Microwave", "Hot water kettle", "Carbon mono... $364.00 1 3
7643 https://www.airbnb.com/rooms/31025816 Aparthotel in Docklands · ★4.21 · 2 bedrooms ·... 16970146 Melbourne -37.81403 144.94222 Hotel room 4 NaN 2.0 2.0 ["Microwave", "Coffee maker", "High chair", "H... $282.00 1 19
8388 https://www.airbnb.com/rooms/34046314 Serviced apartment in Southbank · ★4.84 · 2 be... 255905173 Melbourne -37.82815 144.96857 Hotel room 4 NaN 2.0 2.0 ["Microwave", "Carbon monoxide alarm", "Hot wa... $353.00 1 44
8641 https://www.airbnb.com/rooms/35914383 Serviced apartment in Williamstown · ★4.67 · 2... 270134235 Hobsons Bay -37.85641 144.87878 Hotel room 4 NaN 2.0 3.0 ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... $394.00 6 9
8696 https://www.airbnb.com/rooms/35717112 Home in Melton West · Studio · 1 bed 268666048 Melton -37.66411 144.56059 Private room 2 NaN NaN 1.0 ["Smoke alarm", "Wifi", "Kitchen", "Private li... $75.00 1 0
9051 https://www.airbnb.com/rooms/38076289 Aparthotel in Docklands · ★4.26 · 3 bedrooms ·... 16970146 Melbourne -37.81403 144.94222 Hotel room 6 NaN 3.0 4.0 ["Microwave", "Coffee maker", "High chair", "H... $429.00 2 42
9238 https://www.airbnb.com/rooms/38883439 Serviced apartment in Point Cook · ★4.17 · 2 b... 296379242 Wyndham -37.89366 144.76554 Hotel room 4 NaN 2.0 2.0 ["Security cameras on property", "Smoke alarm"... $269.00 1 6
10898 https://www.airbnb.com/rooms/43540050 Villa in Murrumbeena · ★4.91 · Studio · 1 bed 166664813 Glen Eira -37.90158 145.06935 Private room 2 NaN NaN 1.0 ["Security cameras on property", "Wifi", "Kitc... $40.00 7 23

All the 11 missing values in the "bathrooms_text" column correspond to properties with no specified number of bathrooms. Therefore, I've decided to fill these NaN values with '0 baths'.

In [21]:
# Fill all the 11 "bathrooms_text" missing values
data_v2.loc[data_v2['bathrooms_text'].isnull(), 'bathrooms_text'] = '0 baths'

Now, I'm going to check if that was treated.

In [22]:
# Check "bathrooms_text" missing values
data_v2['bathrooms_text'].isnull().sum()
Out[22]:
0

After treating the 'bathrooms_text' missing values, I will examine the 'bedrooms' column, which has been found to contain 6297 missing values. Here, I'm just previewing the first 10 rows.

In [23]:
# Check "bedrooms" column empty values
data_v2.loc[data_v2['bedrooms'].isnull()].head(10)
Out[23]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
0 https://www.airbnb.com/rooms/10803 Rental unit in Brunswick East · ★4.49 · 1 bedr... 38901 Moreland -37.766060 144.979510 Private room 2 1 shared bath NaN 1.0 ["Microwave", "Hot water kettle", "Drying rack... $49.00 5 173
3 https://www.airbnb.com/rooms/41836 Home in Reservoir · ★4.71 · 1 bedroom · 1 bed ... 182833 Darebin -37.697610 145.000660 Private room 2 1 shared bath NaN 1.0 ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... $40.00 7 159
7 https://www.airbnb.com/rooms/603007 Rental unit in South Yarra · ★4.76 · 1 bedroom... 2705870 Melbourne -37.830220 144.986820 Private room 2 1 shared bath NaN 1.0 ["Private living room", "Carbon monoxide alarm... $59.00 1 21
15 https://www.airbnb.com/rooms/628156 Condo in Port Melbourne · ★4.82 · 1 bedroom · ... 3119669 Port Phillip -37.841580 144.937940 Private room 2 1 private bath NaN 1.0 ["Stainless steel gas stove", "Microwave", "AC... $90.00 5 143
17 https://www.airbnb.com/rooms/44699 Condo in South Melbourne · ★4.75 · 1 bedroom ·... 189245 Port Phillip -37.831557 144.970644 Private room 2 1 shared bath NaN 1.0 ["Stainless steel gas stove", "Microwave", "St... $88.00 1 54
20 https://www.airbnb.com/rooms/1234561 Home in St Kilda · ★4.77 · 1 bedroom · 1 bed ·... 6731712 Port Phillip -37.858440 144.988940 Private room 1 1 bath NaN 1.0 ["Smoking allowed", "Wifi", "Kitchen", "Dedica... $39.00 2 23
27 https://www.airbnb.com/rooms/67211 Rental unit in Kew · ★4.82 · 1 bedroom · 1 bed... 326880 Boroondara -37.803470 145.035830 Private room 1 1 shared bath NaN 1.0 ["Microwave", "Free street parking", "Lockbox"... $50.00 2 225
30 https://www.airbnb.com/rooms/70328 Townhouse in Mordialloc · ★4.94 · 1 bedroom · ... 356006 Kingston -38.007300 145.085440 Private room 2 1 private bath NaN 1.0 ["Clothing storage: closet", "Microwave", "Hot... $65.00 1 147
32 https://www.airbnb.com/rooms/78143 Townhouse in Prahran · ★4.79 · 2 bedrooms · 2 ... 419767 Stonnington -37.851620 144.989430 Private room 2 1 private bath NaN 2.0 ["Microwave", "Hot water kettle", "Shared outd... $96.00 2 223
33 https://www.airbnb.com/rooms/671981 Home in Yarraville · 1 bedroom · 1 bed · 1 sha... 3395313 Maribyrnong -37.815070 144.888950 Shared room 1 1 shared bath NaN 1.0 ["TV"] $100.00 1 0

Here, I will create a new DataFrame 'bedrooms_info' that contains the split and cleaned information from the 'name' column where we have 'bedrooms' information. This information can then be used to fill the missing values in the 'bedrooms' column.

In [24]:
# Extract values from the column "name" to fill missing values in the "bedrooms" column
bedrooms_info = data_v2[data_v2['bedrooms'].isnull()]['name'].str.split('·', expand=True).applymap(lambda x: x.strip() if x is not None else x).copy()

We've managed to extract information about bedrooms from the 'name' column. We can observe that the second and third columns contain bedroom information. This can be used to handle missing values in the 'bedrooms' column.

In [25]:
# Preview the first 10 rows
bedrooms_info.head(10)
Out[25]:
0 1 2 3 4
0 Rental unit in Brunswick East ★4.49 1 bedroom 1 bed 1 shared bath
3 Home in Reservoir ★4.71 1 bedroom 1 bed 1 shared bath
7 Rental unit in South Yarra ★4.76 1 bedroom 1 bed 1 shared bath
15 Condo in Port Melbourne ★4.82 1 bedroom 1 bed 1 private bath
17 Condo in South Melbourne ★4.75 1 bedroom 1 bed 1 shared bath
20 Home in St Kilda ★4.77 1 bedroom 1 bed 1 bath
27 Rental unit in Kew ★4.82 1 bedroom 1 bed 1 shared bath
30 Townhouse in Mordialloc ★4.94 1 bedroom 1 bed 1 private bath
32 Townhouse in Prahran ★4.79 2 bedrooms 2 beds 1 private bath
33 Home in Yarraville 1 bedroom 1 bed 1 shared bath None

As mentioned above, we have bedroom information in the second and third columns. So, I will use the .value_counts() method to count the distribution of each value, in this case, in the second column.

In [26]:
# Check the values from the second column to extract the bedroom information
bedrooms_info[1].value_counts()
Out[26]:
1 bedroom     2121
★5.0           727
★New           336
2 bedrooms     226
Studio         200
★4.67          121
★4.75          101
★4.80           93
★4.83           87
★4.50           84
★4.86           80
★4.91           78
★4.94           76
★4.90           74
★4.95           72
★4.0            71
★4.89           70
★4.88           66
★4.93           65
★4.33           65
★4.92           62
★4.96           60
★4.78           59
★4.85           59
3 bedrooms      56
★4.97           50
★4.82           42
★4.71           40
★4.87           38
★4.60           38
★4.25           38
★4.79           36
★4.69           33
★4.77           32
★4.98           30
★4.81           30
★4.56           28
★4.84           26
★4.38           26
★4.64           25
★4.76           24
★4.72           23
★4.63           23
★4.73           22
4 bedrooms      21
★4.74           19
★4.70           19
★4.68           18
★4.58           17
★4.40           17
★4.65           17
★4.62           16
★4.57           16
★4.44           16
★4.55           15
★4.53           14
★4.54           14
★4.43           13
★3.67           13
★4.61           13
★4.47           13
★3.33           11
★3.50           11
★4.59           11
★4.20           11
★4.42           11
★4.36           11
★4.45           10
★4.46           10
★3.75            9
★4.27            8
★3.83            8
★4.17            7
★4.37            7
★4.29            7
★3.0             7
★4.66            7
★4.41            7
★4.52            7
★4.14            7
★3.60            6
★4.13            6
★4.31            6
★4.22            5
★3.80            5
★4.99            5
★4.09            5
★4.11            5
★4.30            5
★4.35            5
★4.39            4
★0.0             4
★4.26            4
★4.16            4
1 bed            4
★4.10            4
★3.86            3
6 bedrooms       3
★4.08            3
★4.49            3
★4.51            3
★4.24            3
★4.15            3
★4.28            3
5 bedrooms       3
★4.04            2
★3.89            2
★4.23            2
★4.21            2
★3.57            2
★4.34            2
★2.33            2
★4.19            2
★4.32            2
★4.06            2
★2.67            2
★2.0             1
★3.68            1
★3.20            1
★4.48            1
★3.79            1
★3.91            1
★2.80            1
3 beds           1
★3.85            1
★2.20            1
★3.92            1
7 bedrooms       1
★3.94            1
★2.75            1
★3.25            1
★4.07            1
★3.40            1
★3.93            1
2 beds           1
★3.90            1
Name: 1, dtype: int64

The results from the last cell show a variety of values. However, we're specifically interested in the number of bedrooms. To extract this information, I'll create a list with specific values.

In [27]:
# Create a list to extract the bedroom information from the second column in bedrooms_info
specific_values_1 = ['Studio', '1 bedroom', '2 bedrooms', '3 bedrooms', 
                    '4 bedrooms', '5 bedrooms', '6 bedrooms',
                    '7 bedrooms']

Here, I'll replicate the previous process to count the distribution of each value, this time focusing on the third column.

In [28]:
# Check the values from the third column to extract the bedroom information
bedrooms_info[2].value_counts()
Out[28]:
1 bedroom           2801
1 bed               2158
Studio               529
2 beds               309
2 bedrooms           215
3 beds                63
3 bedrooms            57
4 beds                33
1 shared bath         29
4 bedrooms            25
1 bath                15
1 private bath        15
5 beds                12
6 bedrooms             7
5 bedrooms             6
2 baths                3
1.5 shared baths       3
14 bedrooms            3
0 shared baths         2
1.5 baths              2
2 shared baths         2
2.5 shared baths       2
8 beds                 1
7 bedrooms             1
0 baths                1
6 beds                 1
7 beds                 1
Name: 2, dtype: int64

In this cell, I'm going to create a new list with specific values to extract bedrooms info from the third column.

In [29]:
# Create a list to extract the bedroom information from the third column in bedrooms_info
specific_values_2 = ['Studio', '1 bedroom', '2 bedrooms', '3 bedrooms', 
                    '4 bedrooms', '5 bedrooms', '6 bedrooms',
                    '7 bedrooms', '14 bedrooms']

After creating the list 'specific_values_1', I'll extract rows from the DataFrame bedrooms_info where the second column's value is in 'specific_values_1'. I'll use the .isin() function for this and save the result in bedrooms_info_1.

In [30]:
# Extract values from the second column
bedrooms_info_1 = bedrooms_info[bedrooms_info[1].isin(specific_values_1)]

And here, I'm going to do the same for the third column, storing the results in bedrooms_info_2. This is a common operation when you want to filter data based on certain criteria.

In [31]:
# Extract values from the third column
bedrooms_info_2 = bedrooms_info[bedrooms_info[2].isin(specific_values_2)]

With the bedroom information extracted, I will create a new DataFrame. This DataFrame will concatenate the data from the second and third columns.

In [32]:
# Concat bedroom information from the second and third columns
bedrooms_info_complete = pd.concat([bedrooms_info_1[1], bedrooms_info_2[2]],)

Now I will sort the DataFrame index using the .sort_index() method, passing as argument inplace equals True (which means that the sorting is done on the original DataFrame).

In [33]:
# Sort the index
bedrooms_info_complete.sort_index(inplace=True)

By applying the .value_counts() to the bedrooms_info_complete DataFrame, we can count the frequency of each unique value that was extracted.

In [34]:
# Check unique value
bedrooms_info_complete.value_counts()
Out[34]:
1 bedroom      4922
Studio          729
2 bedrooms      441
3 bedrooms      113
4 bedrooms       46
6 bedrooms       10
5 bedrooms        9
14 bedrooms       3
7 bedrooms        2
dtype: int64

The values extracted into the 'bedrooms_info_complete' are strings, but the values in the 'bedrooms' column are numeric. Therefore, I'm going to convert the values in 'bedrooms_info_complete' to numeric.

In [35]:
# Convert string values to numeric values
replace_values = {'Studio': 0.0, '1 bedroom': 1.0, '2 bedrooms': 2.0, '3 bedrooms': 3.0, '4 bedrooms': 4.0, '5 bedrooms': 5.0, '6 bedrooms': 6.0, '7 bedrooms': 7.0, '14 bedrooms': 14.0}
bedrooms_info_complete = bedrooms_info_complete.replace(replace_values)

Using the .value_counts() to check the converted values.

In [36]:
# Check unique values 
bedrooms_info_complete.value_counts()
Out[36]:
1.0     4922
0.0      729
2.0      441
3.0      113
4.0       46
6.0       10
5.0        9
14.0       3
7.0        2
dtype: int64

Before filling in missing values in 'bedrooms' column, I will create a backup of the DataFrame.

In [37]:
# Make another DataFrame backup
data_v3 = data_v2.copy()

After creating a backup, I will use the .fillna() method to fill in the missing values in the ‘bedrooms’ column with the extracted bedroom information.

In [38]:
# Fill the "bedrooms" missing values with bedroom information
data_v3['bedrooms'] = data_v3['bedrooms'].fillna(bedrooms_info_complete)

Upon inspecting the 'bedrooms' column in our DataFrame, we find that there are still 22 missing values. This requires a more detailed analysis.

In [39]:
# Check "bedrooms" missing values
data_v3['bedrooms'].isnull().sum()
Out[39]:
22

As we have just 22 missing values, I’m going to inspect each of them to try to find some patterns.

In [40]:
# Check the "bedrooms" 22 missing values
data_v3.loc[data_v3['bedrooms'].isnull()].head(22)
Out[40]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
1161 https://www.airbnb.com/rooms/6825713 Place to stay in Collingwood · ★4.82 · 2 beds ... 24685614 Yarra -37.79515 144.985340 Entire home/apt 3 1 bath NaN 2.0 ["Microwave", "Drying rack for clothing", "Car... $139.00 2 298
2327 https://www.airbnb.com/rooms/12293289 Castle in The Patch · ★4.97 · 1 bed · 1 bath 66261359 Yarra Ranges -37.88414 145.391480 Entire home/apt 2 1 bath NaN 1.0 ["Microwave", "Hot water kettle", "BBQ grill",... $489.00 1 149
2628 https://www.airbnb.com/rooms/13303352 Farm stay in Chum Creek/Healesville · ★4.95 · ... 21470508 Yarra Ranges -37.61723 145.496610 Entire home/apt 2 1 bath NaN 1.0 ["Microwave", "Induction stove", "Hot water ke... $141.00 1 393
4254 https://www.airbnb.com/rooms/19989578 Place to stay in Fitzroy · 2 beds · 1 shared bath 26178901 Yarra -37.79769 144.980440 Private room 4 1 shared bath NaN 2.0 ["Wifi", "Kitchen", "Washer", "Hangers", "Esse... $300.00 1 0
5314 https://www.airbnb.com/rooms/22488519 Place to stay in South Melbourne · 1 bed · 1 s... 526851 Port Phillip -37.83458 144.968890 Private room 1 1 shared bath NaN 1.0 ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... $50.00 1 1
5503 https://www.airbnb.com/rooms/22814463 Place to stay in Badger Creek · ★4.85 · 1 bed ... 107919473 Yarra Ranges -37.68255 145.557360 Entire home/apt 2 1 bath NaN 1.0 ["Kitchen", "Stove", "Patio or balcony", "Dish... $199.00 2 185
5612 https://www.airbnb.com/rooms/23069274 Tiny home in Clematis · ★4.90 · 1 bed · 1 bath 62077033 Cardinia -37.92948 145.410450 Entire home/apt 2 1 bath NaN 1.0 ["Microwave", "Hot water kettle", "Drying rack... $130.00 1 263
6114 https://www.airbnb.com/rooms/24945953 Castle in The Patch · ★4.92 · 1 bed · 1 bath 66261359 Yarra Ranges -37.88362 145.390070 Entire home/apt 2 1 bath NaN 1.0 ["Microwave", "Induction stove", "Hot water ke... $494.00 1 38
6523 https://www.airbnb.com/rooms/26888053 Nature lodge in Mt Dandenong · ★4.78 · 1 bed 156749664 Yarra Ranges -37.82976 145.352280 Hotel room 2 0 baths NaN 1.0 ["Clothing storage: closet", "Microwave", "Hot... $325.00 1 41
8331 https://www.airbnb.com/rooms/34515725 Tiny home in Beaumaris · 1 bed · 1 shared bath 29665549 Bayside -37.99194 145.041330 Private room 2 1 shared bath NaN 1.0 ["Smoke alarm", "Kitchen", "Hot tub", "Carbon ... $150.00 1 0
8534 https://www.airbnb.com/rooms/34977090 Farm stay in Yarra Junction · ★4.83 · 1 bed · ... 25918877 Yarra Ranges -37.78794 145.639140 Entire home/apt 2 1 bath NaN 1.0 ["Microwave", "Hot water kettle", "BBQ grill",... $206.00 1 166
8712 https://www.airbnb.com/rooms/36046114 Tiny home in Sandringham · ★4.93 · 3 beds · 1 ... 229196502 Bayside -37.95524 145.021260 Entire home/apt 6 1 bath NaN 3.0 ["Integrated Fisher and Paykel refrigerator", ... $220.00 3 111
8980 https://www.airbnb.com/rooms/37500292 Tiny home in East Warburton · ★4.81 · 3 beds ·... 262886891 Yarra Ranges -37.73093 145.752710 Entire home/apt 3 1 bath NaN 3.0 ["BBQ grill", "Breakfast", "Barbecue utensils"... $165.00 2 34
9104 https://www.airbnb.com/rooms/38117602 Farm stay in Healesville · ★4.40 · 1 bed · 1.5... 100564438 Yarra Ranges -37.67870 145.515640 Private room 1 1.5 baths NaN 1.0 ["Smoking allowed", "Smoke alarm", "Wifi", "Pr... $110.00 300 6
9345 https://www.airbnb.com/rooms/38979142 Camper/RV in St Kilda · 1 bed · 0 baths 56324731 Port Phillip -37.86176 144.972330 Entire home/apt 3 0 baths NaN 1.0 ["Smoking allowed", "Kitchen", "First aid kit"... $35.00 120 0
9806 https://www.airbnb.com/rooms/40377799 Tent in Healesville · ★4.81 · 1 bed · 1 bath 18768614 Yarra Ranges -37.64708 145.534910 Entire home/apt 2 1 bath NaN 1.0 ["Smoke alarm", "Patio or balcony", "Dishes an... $200.00 1 194
9960 https://www.airbnb.com/rooms/40510306 Barn in Dixons Creek · ★4.95 · 1 bed · 1 bath 272422171 Yarra Ranges -37.61312 145.403040 Entire home/apt 2 1 bath NaN 1.0 ["Microwave", "Breakfast", "Coffee maker", "Fr... $150.00 2 65
10179 https://www.airbnb.com/rooms/41382311 Hut in Millgrove · 3 beds · 1 bath 15783446 Yarra Ranges -37.75091 145.648950 Entire home/apt 5 1 bath NaN 3.0 ["Kitchen", "Essentials", "Free parking on pre... $80.00 50 0
11188 https://www.airbnb.com/rooms/46096289 Tent in The Patch · ★4.82 · 1 bed · 1 bath 31068242 Yarra Ranges -37.88634 145.399630 Entire home/apt 2 1 bath NaN 1.0 ["Bluetooth sound system", "Outdoor shower", "... $185.00 1 130
11768 https://www.airbnb.com/rooms/48741282 Camper/RV in Lilydale · 1 bed · 1 bath 7104573 Yarra Ranges -37.75890 145.391040 Entire home/apt 2 1 bath NaN 1.0 ["Security cameras on property", "Smoke alarm"... $170.00 2 0
13040 https://www.airbnb.com/rooms/53895756 Place to stay in Alphington · ★4.92 · 2 beds ·... 238794429 Darebin -37.77995 145.031630 Entire home/apt 3 1 bath NaN 2.0 ["AC - split type ductless system", "Lockbox",... $137.00 2 87
16833 https://www.airbnb.com/rooms/767566122456779194 Tiny home in Healesville · ★4.97 · 1 bed · 1 bath 50611336 Yarra Ranges -37.65515 145.513968 Entire home/apt 2 1 bath NaN 1.0 ["Wifi", "Kitchen", "BBQ grill", "Refrigerator... $179.00 1 30

After inspecting all 22 missing values in the 'bedrooms' column, it was observed that they all correspond to 'Studio'. Therefore, these mising values will be replaced with 0.0, which represents 'Studio'.

In [41]:
# Fill in the missing values in the 'bedrooms' column with 0.0, which represents 'Studio'
data_v3[data_v3['bedrooms'].isnull()] = data_v3[data_v3['bedrooms'].isnull()].fillna(0.0)

Missing values in the 'bedrooms' column have been successfully handled. The next step is to investigate and address the 183 missing values in the 'beds' column.

In [42]:
# Check for missing values
data_v3.isnull().sum()
Out[42]:
listing_url                 0
name                        0
host_id                     0
neighbourhood_cleansed      0
latitude                    0
longitude                   0
room_type                   0
accommodates                0
bathrooms_text              0
bedrooms                    0
beds                      183
amenities                   0
price                       0
minimum_nights              0
number_of_reviews           0
dtype: int64

Let's display the first 5 rows in the 'beds' column with NaN values.

In [43]:
# Check "beds" column missing values
data_v3[data_v3['beds'].isnull()].head()
Out[43]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
120 https://www.airbnb.com/rooms/268620 Home in Caroline Springs · 1 bedroom 1407948 Melton -37.73975 144.73742 Private room 1 0 baths 1.0 NaN [] $72.00 1 0
211 https://www.airbnb.com/rooms/508590 Home in Balwyn North · ★4.63 · 1 bedroom 2255480 Boroondara -37.79370 145.07935 Private room 1 0 baths 1.0 NaN [] $50.00 1 16
212 https://www.airbnb.com/rooms/1742469 Home in Caulfield South · ★4.92 · 1 bedroom · ... 9173051 Glen Eira -37.89613 145.01357 Private room 1 1 private bath 1.0 NaN ["Microwave", "Fast wifi \u2013 56 Mbps", "Hot... $83.00 2 62
245 https://www.airbnb.com/rooms/1815508 Rental unit in Richmond · ★4.83 · 1 bedroom · ... 9509273 Yarra -37.82210 144.99922 Entire home/apt 2 1 bath 1.0 NaN ["Microwave", "Free street parking", "Lockbox"... $141.00 22 413
642 https://www.airbnb.com/rooms/4372398 Rental unit in St Kilda · ★4.66 · 1 bedroom · ... 22700287 Port Phillip -37.86735 144.97775 Entire home/apt 2 1 bath 1.0 NaN ["Microwave", "Hot water kettle", "Books and r... $125.00 3 74

Upon analyzing the data, a common pattern emerged: listings with either 1 bedroom or a studio, accommodating no more than 2 guests, typically have 1 bed. As such, the missing 'beds' values in these listings will be filled with the mode, which is 1.

In [44]:
# Impute missing 'beds' values for listings accommodating up to 2 guests in 1 bedroom or less with the mode of 'beds', which is 1
data_v3.loc[(data_v3['beds'].isnull()) & (data_v3['accommodates'] <= 2) & (data_v3['bedrooms'] <= 1), 'beds'] = st.mode(data_v3['beds'])

We still have 31 missings values in the 'beds' column.

In [45]:
# Check for missing values
data_v3.isnull().sum()
Out[45]:
listing_url                0
name                       0
host_id                    0
neighbourhood_cleansed     0
latitude                   0
longitude                  0
room_type                  0
accommodates               0
bathrooms_text             0
bedrooms                   0
beds                      31
amenities                  0
price                      0
minimum_nights             0
number_of_reviews          0
dtype: int64

After dealing with missing values based on a specific condition, the 'listing_url' column will be used to tackle the remaining missing data.

In [46]:
# Analyze the missing 'beds' values and checking the 'listing_url' column as a reference to handle the missing values
data_v3[data_v3['beds'].isnull()].head(10)
Out[46]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
783 https://www.airbnb.com/rooms/5573248 Home in Richmond · ★4.89 · 2 bedrooms · 1.5 baths 6518033 Yarra -37.81748 144.99520 Entire home/apt 4 1.5 baths 2.0 NaN ["Stainless steel gas stove", "Microwave", "St... $206.00 1 264
845 https://www.airbnb.com/rooms/5096866 Home in Coburg North · ★4.71 · 1 bedroom · 1 s... 26336840 Moreland -37.72044 144.96969 Private room 3 1 shared bath 1.0 NaN ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... $42.00 5 43
2430 https://www.airbnb.com/rooms/12633686 Townhouse in Mulgrave · 4 bedrooms · 2.5 baths 40094786 Monash -37.92501 145.18905 Entire home/apt 6 2.5 baths 4.0 NaN ["Microwave", "Hot water kettle", "BBQ grill",... $380.00 15 0
2780 https://www.airbnb.com/rooms/13592574 Rental unit in Prahran · ★4.94 · 1 bedroom · 1... 78547489 Stonnington -37.85131 144.98881 Entire home/apt 4 1 bath 1.0 NaN ["Stainless steel gas stove", "Microwave", "Ho... $160.00 4 35
2895 https://www.airbnb.com/rooms/14094124 Townhouse in Brighton · ★5.0 · 2 bedrooms · 2 ... 84806429 Bayside -37.88982 145.00013 Entire home/apt 4 2 baths 2.0 NaN ["Wifi", "Kitchen", "Dryer", "Iron", "First ai... $175.00 3 3
3419 https://www.airbnb.com/rooms/16440438 Cottage in Hoddles Creek · ★5.0 · 2 bedrooms ·... 88051903 Yarra Ranges -37.81570 145.56603 Entire home/apt 4 1 bath 2.0 NaN ["Microwave", "Hot water kettle", "AC - split ... $330.00 2 53
3478 https://www.airbnb.com/rooms/16233947 Home in Clifton Hill · ★5.0 · 3 bedrooms · 2 b... 33071051 Yarra -37.78820 145.00014 Entire home/apt 4 2 baths 3.0 NaN ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Dr... $350.00 7 12
5786 https://www.airbnb.com/rooms/23931198 Villa in Carrum · ★4.83 · 3 bedrooms · 1 share... 131401047 Kingston -38.08143 145.14034 Private room 1 1 shared bath 3.0 NaN ["Smoke alarm", "Wifi", "Kitchen", "Bed linens... $38.00 3 6
8229 https://www.airbnb.com/rooms/33601421 Home in Kew · ★5.0 · 2 bedrooms · 1.5 shared b... 50264189 Boroondara -37.79356 145.03792 Private room 2 1.5 shared baths 2.0 NaN ["Microwave", "BBQ grill", "Private living roo... $69.00 1 3
9302 https://www.airbnb.com/rooms/38472816 Condo in Southbank · ★4.53 · 1 bedroom · 1 bath 155404054 Melbourne -37.82225 144.96119 Entire home/apt 4 1 bath 1.0 NaN ["Microwave", "Bluetooth sound system", "Harbo... $122.00 1 74

Out of 31 missing values, each was checked against a specifc 'listing_url'. I identified 26 corresponding values that will be used to fill the NaNs.

In [47]:
# Fill the 'beds' missing values
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/5573248', 'beds'] = 4.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/5096866', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/13592574', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/14094124', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/16440438', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/16233947', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/23931198', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/33601421', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/38472816', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/39319156', 'beds'] = 9.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/39454193', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/40165294', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/40603921', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/42317481', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/42318082', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/43293837', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/43117636', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/46879468', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/49820299', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/582835905208247854', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/720080948267459603', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/799591397619258644', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/798136718669520279', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/840129849005783804', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/939039430629377525', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/962738485338707503', 'beds'] = 2.0

For the remaining 5 values, sufficient information was not found, so I'm going to drop them and reset the index using .reset_index() method.

In [48]:
# Drop 5 rows from the 'beds' column
data_v3 = data_v3.dropna(subset=['beds']).reset_index(drop=True)

Let's check the missing values.

In [49]:
# Check for missing values
round(data_v3.isnull().sum())
Out[49]:
listing_url               0
name                      0
host_id                   0
neighbourhood_cleansed    0
latitude                  0
longitude                 0
room_type                 0
accommodates              0
bathrooms_text            0
bedrooms                  0
beds                      0
amenities                 0
price                     0
minimum_nights            0
number_of_reviews         0
dtype: int64

All missing values have now been handled.

In [50]:
# Check the percentage of missing values
round(data_v3.isnull().sum() / len(data_v3) * 100,2)
Out[50]:
listing_url               0.0
name                      0.0
host_id                   0.0
neighbourhood_cleansed    0.0
latitude                  0.0
longitude                 0.0
room_type                 0.0
accommodates              0.0
bathrooms_text            0.0
bedrooms                  0.0
beds                      0.0
amenities                 0.0
price                     0.0
minimum_nights            0.0
number_of_reviews         0.0
dtype: float64

Data Type Conversion¶

Ensuring that your data is in the correct format before beginning the analysis is crucial as it helps prevent errors and enhances the accuracy and efficiency of your analysis. Therefore, I will use the .dtypes attribute to verify the data types of the columns. Upon inspection, we observe that the ‘bedrooms’, ‘beds’, and ‘price’ columns need formatting.

In [51]:
# Check data type
data_v3.dtypes
Out[51]:
listing_url                object
name                       object
host_id                     int64
neighbourhood_cleansed     object
latitude                  float64
longitude                 float64
room_type                  object
accommodates                int64
bathrooms_text             object
bedrooms                  float64
beds                      float64
amenities                  object
price                      object
minimum_nights              int64
number_of_reviews           int64
dtype: object

The ‘price’ column is currently an object data type, but it needs to be numeric. To achieve this, I will use a regular expression to replace any occurrence of a dollar sign ($) or comma (,) in the ‘price’ column with an empty string (''), effectively removing these characters. The .str attribute allows string methods to be applied to the Series object. Finally, I will use pd.to_numeric() to convert the column into a numeric data type.

In [52]:
# Convert the column 'price' to numeric
data_v3['price'] = pd.to_numeric(data_v3['price'].str.replace(r'[\$,]', '', regex=True))

Checking the 'price' column data type.

In [53]:
# Check 'price' data type
data_v3['price'].dtypes
Out[53]:
dtype('float64')

The ‘bedrooms’ and ‘beds’ columns in the DataFrame are currently of float data type, which I plan to convert to integers. To do this, I will use the .astype() method with ‘int64’ as the argument, indicating that the values should be converted to 64-bit integers.

In [54]:
# Convert the columns 'bedrooms' and 'beds' to integers
data_v3[['bedrooms', 'beds']] = data_v3[['bedrooms', 'beds']].astype('int64')

Checking the 'bedrooms' and 'beds' column data type.

In [55]:
# Check 'bedrooms' and 'beds' data type
data_v3[['bedrooms', 'beds']].dtypes
Out[55]:
bedrooms    int64
beds        int64
dtype: object

Handling Inconsistent Data¶

In this step, we aim to identify and correct any out-of-place or incorrect data in the DataFrame. I used the .value_counts() method to examine the values in the ‘neighbourhood_cleansed’ column. I found an outdated local government area name, ‘Moreland’, which has been renamed to ‘Merri-bek’.

In [56]:
# Check unique values in 'neighbourhood_cleansed' column
data_v3['neighbourhood_cleansed'].value_counts()
Out[56]:
Melbourne            7279
Port Phillip         2742
Yarra                1776
Stonnington          1514
Yarra Ranges         1112
Moreland              942
Monash                590
Boroondara            589
Whitehorse            583
Wyndham               576
Glen Eira             558
Darebin               551
Maribyrnong           515
Bayside               401
Kingston              363
Moonee Valley         353
Hobsons Bay           267
Frankston             263
Brimbank              258
Casey                 234
Manningham            227
Hume                  207
Banyule               201
Cardinia              186
Knox                  172
Greater Dandenong     172
Whittlesea            172
Melton                157
Maroondah             117
Nillumbik             103
Name: neighbourhood_cleansed, dtype: int64

To handle that, I'm going to filter the data where 'neighbourhood_cleansed' values are equal to 'Moreland', and finally replace them with 'Merri-bek'.

In [57]:
# Replace 'Moreland' values in the 'neighbourhood_cleansed' column with 'Merri-bek'
data_v3.loc[data_v3['neighbourhood_cleansed'] == 'Moreland', 'neighbourhood_cleansed'] = 'Merri-bek'

Duplicate data¶

Looking for duplicates, I used the .duplicated().sum() method, but none were found.

In [58]:
# Check duplicate data
data_v3.duplicated().sum()
Out[58]:
0

Outliers¶

Outliers are data points in a DataFrame that significantly differ from others. By using the .describe() method to inspect quantitative data, I noticed some extreme values in the ‘beds’, ‘price’, and ‘minimum_nights’ columns. Therefore, I plan to examine these columns using a boxplot.

In [59]:
# Check quantitative data
data_v3.describe()
Out[59]:
host_id latitude longitude accommodates bedrooms beds price minimum_nights number_of_reviews
count 2.318000e+04 23180.000000 23180.000000 23180.000000 23180.000000 23180.000000 23180.000000 23180.000000 23180.000000
mean 1.607798e+08 -37.828200 145.013867 3.659232 1.717990 2.120449 230.782442 5.305824 30.810224
std 1.585869e+08 0.074548 0.150693 2.357892 1.073017 1.755538 1333.728563 29.119214 60.811455
min 7.834000e+03 -38.225690 144.518047 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000
25% 3.038696e+07 -37.854085 144.956621 2.000000 1.000000 1.000000 95.000000 1.000000 1.000000
50% 9.884258e+07 -37.819435 144.977230 3.000000 1.000000 2.000000 149.000000 2.000000 7.000000
75% 2.628817e+08 -37.802090 145.022482 4.000000 2.000000 3.000000 226.000000 3.000000 32.000000
max 5.351276e+08 -37.484800 145.844997 16.000000 14.000000 117.000000 104983.000000 1125.000000 987.000000

Checking 'beds' column¶

To create a boxplot, I used the box function from the plotly.express module (which is often imported as px), and stored that in the variable fig. Then, I used the .update_layout() function to customize the layout of the plot. Displaying the plot through .show() function, we can observe the presence of extreme values in the 'beds' column.

In [60]:
# Create the graph 
fig = px.box(data_v3, x='beds')
fig.update_layout(
    title='Boxplot analysis variable - beds',
    title_x=0.5,
    autosize=False,
    width=800,
    height=500)

# Show the boxplot
fig.show()

Observing the presence of extreme values in the boxplot above, caution is necessary as these extreme values may not necessarily be outliers. For further investigation, I will use the 'listing_url' column as a reference.

In [61]:
# Inspect the 'beds' column where values are higher or equal to 22 using the 'listing_url' to check the 'beds' values
data_v3[data_v3['beds'] >= 22].head(3)
Out[61]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
11279 https://www.airbnb.com/rooms/46266224 Vacation home in Yarra Glen · 10 bedrooms · 22... 107232439 Yarra Ranges -37.622600 145.372260 Entire home/apt 16 9 baths 10 22 ["Microwave", "BBQ grill", "Barbecue utensils"... 2400.0 2 2
15169 https://www.airbnb.com/rooms/690072715126723461 casa particular in Melbourne · 3 bedrooms · 11... 457710697 Melbourne -37.815070 144.962610 Entire home/apt 2 2.5 baths 3 117 ["Smoking allowed", "Window AC unit", "Pets al... 19045.0 1 0
19381 https://www.airbnb.com/rooms/857121112148324865 Home in Deer Park · 3 bedrooms · 31 beds · 2 b... 485621814 Brimbank -37.761819 144.772296 Entire home/apt 16 2 baths 3 31 ["Microwave", "Hot water kettle", "BBQ grill",... 169.0 1 1

Using the .shape attribute revealed 463 extreme values. Hovewer, a detailed examination identified only one true outlier. Without this through investigation, we could have erroneously removed 462 valid rows.

In [62]:
# Filter 'beds' column with values above or equal to 7 and counting them
data_v3[data_v3['beds'] >= 7].shape
Out[62]:
(463, 15)

Now, I will remove the data point in the 'beds' column that indicates 117 beds, as it's an outlier.

In [63]:
# Remove the outlier from the 'beds' column and reseting the DataFrame index
data_v3 = data_v3[data_v3['beds'] != 117].reset_index(drop=True)

As we can see from our boxplot, there are still extreme values. However, upon closer examination, these extreme values do not qualify as outliers, as I previously mentioned.

In [64]:
# Create the graph
fig = px.box(data_v3, x='beds')
fig.update_layout(
    title='Boxplot analysis variable - beds',
    title_x=0.5,
    autosize=False,
    width=800,
    height=500)

# Show the boxplot
fig.show()

Checking 'price' column¶

Checking quantitative data, we found values equal to 0 in 'price' column and that does not make sense. Therefore, I will check that.

In [65]:
# Check quantitative data
data_v3.describe()
Out[65]:
host_id latitude longitude accommodates bedrooms beds price minimum_nights number_of_reviews
count 2.317900e+04 23179.000000 23179.000000 23179.000000 23179.000000 23179.000000 23179.000000 23179.000000 23179.000000
mean 1.607670e+08 -37.828201 145.013869 3.659304 1.717934 2.115492 229.970749 5.306010 30.811554
std 1.585783e+08 0.074549 0.150696 2.357918 1.073007 1.585129 1328.019560 29.119828 60.812430
min 7.834000e+03 -38.225690 144.518047 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000
25% 3.038696e+07 -37.854090 144.956621 2.000000 1.000000 1.000000 95.000000 1.000000 1.000000
50% 9.883592e+07 -37.819440 144.977230 3.000000 1.000000 2.000000 149.000000 2.000000 7.000000
75% 2.628817e+08 -37.802090 145.022485 4.000000 2.000000 3.000000 226.000000 3.000000 32.000000
max 5.351276e+08 -37.484800 145.844997 16.000000 14.000000 31.000000 104983.000000 1125.000000 987.000000

In this column, I will filter the data to investigate the rows trying to uncover some relevant information.

In [66]:
# Filter 'price' column with values below $30
data_v3[(data_v3['price'] < 30)].head(15)
Out[66]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
367 https://www.airbnb.com/rooms/3094430 Home in Ormond · 1 bedroom · 1 bed · 1 bath 15736128 Glen Eira -37.90021 145.03742 Private room 1 1 bath 1 1 ["Wifi", "Kitchen", "Washer", "First aid kit",... 26.0 1 0
413 https://www.airbnb.com/rooms/3194833 Rental unit in Southbank · 1 bedroom · 2 beds ... 1424387 Port Phillip -37.82871 144.95737 Shared room 2 1 shared bath 1 2 ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Ho... 24.0 1 1
727 https://www.airbnb.com/rooms/4786567 Rental unit in South Yarra · ★4.67 · 1 bedroom... 24674881 Melbourne -37.83538 144.98543 Private room 1 1 bath 1 1 ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "E... 25.0 7 3
761 https://www.airbnb.com/rooms/5196283 Rental unit in Flemington · ★4.60 · 1 bedroom ... 19752824 Melbourne -37.78868 144.92314 Private room 3 1 bath 1 2 ["Smoke alarm", "Wifi", "Dryer", "Pets allowed... 20.0 3 5
1091 https://www.airbnb.com/rooms/7207612 Home in Northcote · 1 bedroom · 1 bed · 1 shar... 37741609 Darebin -37.76533 145.00940 Private room 2 1 shared bath 1 1 ["Wifi", "Smoke alarm", "Kitchen", "Essentials... 25.0 12 0
1118 https://www.airbnb.com/rooms/7253120 Home in Richmond · 1 bedroom · 1 bed · 1 bath 27460296 Yarra -37.81993 144.99620 Private room 2 1 bath 1 1 ["Smoke alarm", "Wifi", "Kitchen", "Washer", "... 28.0 3 2
1182 https://www.airbnb.com/rooms/6855196 Home in Camberwell · 1 bedroom · 1 bed · 1 bath 35925193 Boroondara -37.84727 145.07037 Private room 2 1 bath 1 1 ["Smoking allowed", "Wifi", "Kitchen", "Washer... 13.0 1 0
1228 https://www.airbnb.com/rooms/7440879 Home in Coburg · ★4.88 · 1 bedroom · 1 bed · 1... 33559438 Merri-bek -37.73810 144.97693 Shared room 1 1.5 shared baths 1 1 ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Wa... 25.0 1 8
1300 https://www.airbnb.com/rooms/7060419 Home in Camberwell · 1 bedroom · 1 bed · 1 bath 37024203 Boroondara -37.84818 145.09497 Private room 2 1 bath 1 1 ["Wifi", "Smoke alarm", "Kitchen", "Essentials... 20.0 1 0
1434 https://www.airbnb.com/rooms/8905856 Home in Clifton Hill · 1 bedroom · 1 bed · 1 bath 46582678 Yarra -37.78931 144.99854 Private room 2 1 bath 1 1 ["Wifi", "Kitchen", "Smoke alarm", "TV", "Free... 26.0 1 0
1468 https://www.airbnb.com/rooms/8375432 Home in Carlton North · 1 bedroom · 1 bed · 1 ... 44142350 Yarra -37.78067 144.97007 Private room 1 1 bath 1 1 ["Wifi", "Smoke alarm", "Kitchen", "Breakfast"... 22.0 12 0
1476 https://www.airbnb.com/rooms/8394057 Home in Highett · 1 bedroom · 1 bed · 1.5 baths 44229140 Bayside -37.94779 145.03006 Private room 1 1.5 baths 1 1 ["Wifi", "Smoke alarm", "Kitchen", "Washer", "... 25.0 1 1
1539 https://www.airbnb.com/rooms/8656155 Rental unit in Thornbury · 1 bedroom · 1 bed ·... 45489433 Darebin -37.75685 145.01064 Private room 1 1 bath 1 1 ["Wifi", "Kitchen", "Washer", "Essentials", "F... 27.0 7 0
1822 https://www.airbnb.com/rooms/10000335 Place to stay in Clayton · 2 bedrooms · 1 bed ... 51373610 Monash -37.91737 145.13953 Private room 1 3 shared baths 2 1 ["Smoke alarm", "Wifi", "Kitchen", "Gym", "Iro... 26.0 3 0
1952 https://www.airbnb.com/rooms/10572460 Home in Altona Meadows · 1 bedroom · 1 bed · 1... 51890204 Hobsons Bay -37.87756 144.76848 Private room 2 1 bath 1 1 ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... 26.0 3 2

Upon examining the 'price' column, we noticed that listings priced below $30 showed an update of price information. As we have lots of rows, I decided to implement the strategy to replace these values by the median using the .median() method.

In [67]:
# Replace the values below 30 in the 'price' column with the median of the 'price' column
data_v3.loc[data_v3['price'] < 30, 'price'] = data_v3['price'].median()

Now I'm going to look at the higher prices. To handle with that I will use the boxplot.

In [68]:
# Create the graph 
fig = px.box(data_v3, x='price')
fig.update_layout(
    title='Boxplot analysis variable - price',
    title_x=0.5,
    autosize=False,
    width=800,
    height=500)

# Show the boxplot
fig.show()

Through the boxplot we can see some higher values. So, I will analyze prices higher than $2000 to check if I'm going to consider them, otherwise, I'm going to remove them.

In [69]:
# Filter prices higher than $2000
data_v3[(data_v3['price'] > 2000)].head(10)
Out[69]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
78 https://www.airbnb.com/rooms/1353588 Townhouse in Richmond · ★4.79 · Studio · 2 bed... 6362922 Yarra -37.82157 144.99694 Private room 1 0 shared baths 0 2 ["Wifi", "Kitchen", "Iron", "Hangers", "Breakf... 20000.0 2 95
705 https://www.airbnb.com/rooms/4767182 Home in Malvern · 5 bedrooms · 5 beds · 4 baths 11914644 Stonnington -37.84542 145.03280 Entire home/apt 10 4 baths 5 5 ["Microwave", "Coffee maker", "Host greets you... 3304.0 1 0
1246 https://www.airbnb.com/rooms/7490588 Rental unit in Melbourne · ★4.49 · 2 bedrooms ... 27519695 Melbourne -37.81474 144.95128 Entire home/apt 8 1 bath 2 2 ["Microwave", "Carbon monoxide alarm", "Lockbo... 6825.0 400 400
1885 https://www.airbnb.com/rooms/10476673 Home in Heatherton · 3 bedrooms · 3 beds · 2.5... 31713704 Kingston -37.95436 145.08372 Entire home/apt 6 2.5 baths 3 3 ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Dr... 2243.0 7 0
2347 https://www.airbnb.com/rooms/12100235 Home in Healesville · ★4.90 · 10 bedrooms · 13... 64874108 Yarra Ranges -37.65557 145.51727 Entire home/apt 16 9.5 baths 10 13 ["Microwave", "Hot water kettle", "BBQ grill",... 2127.0 2 49
2461 https://www.airbnb.com/rooms/12719024 Home in Reservoir · 2 bedrooms · 2 beds · 1 bath 69151339 Darebin -37.72707 144.99738 Entire home/apt 4 1 bath 2 2 ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... 12624.0 10 1
2465 https://www.airbnb.com/rooms/12736588 Home in Thornbury · 1 bedroom · 1 bed · 1.5 baths 47493458 Darebin -37.75935 145.00777 Private room 2 1.5 baths 1 1 ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... 2500.0 3 0
2999 https://www.airbnb.com/rooms/14919472 Rental unit in Southbank · ★5.0 · 4 bedrooms ·... 29183827 Melbourne -37.82225 144.96365 Entire home/apt 6 3.5 baths 4 5 ["Smoke alarm", "Wifi", "Kitchen", "Iron", "Ca... 2057.0 7 4
3491 https://www.airbnb.com/rooms/16260007 Home in Reservoir · ★4.0 · 1 bedroom · 1 bed ·... 96127123 Darebin -37.69922 145.01132 Private room 2 1 bath 1 1 ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... 13379.0 1 3
3777 https://www.airbnb.com/rooms/18043096 Rental unit in Saint Kilda · ★4.58 · 2 bedroom... 105279173 Port Phillip -37.86593 144.98351 Private room 2 1 shared bath 2 1 ["Smoke alarm", "Wifi", "Kitchen", "Bed linens... 5500.0 14 12

After analyzing the 'price' column, I found that the marjority of listings priced above $2000 lack information, appear to be duplicates, have no reviews, are outdated, and are not availaible for check in. Therefore, I'm considering removing all listings priced above $2000.

In [70]:
# Remove values above to 2000
data_v3 = data_v3[data_v3['price'] <= 2000].reset_index(drop=True)

The boxplot shows many extreme values. However, these were not considered outliers because they represent valid data points.

In [71]:
# Create the graph 
fig = px.box(data_v3, x='price')
fig.update_layout(
    title='Boxplot analysis variable - price',
    title_x=0.5,
    autosize=False,
    width=800,
    height=500
)

# Show the boxplot
fig.show()

Checking 'minimum_nights' column¶

Using the .describe(), I have observed an unusually high value in the 'minimum_nights' column. Let's investigate that.

In [72]:
# Check quantitative data
data_v3.describe()
Out[72]:
host_id latitude longitude accommodates bedrooms beds price minimum_nights number_of_reviews
count 2.309400e+04 23094.000000 23094.000000 23094.000000 23094.000000 23094.000000 23094.000000 23094.000000 23094.000000
mean 1.608007e+08 -37.828205 145.013783 3.649952 1.712999 2.109725 197.130423 5.235472 30.866372
std 1.586302e+08 0.074545 0.150722 2.346839 1.064941 1.572811 187.706796 28.747963 60.825704
min 7.834000e+03 -38.225690 144.518047 1.000000 0.000000 1.000000 30.000000 1.000000 0.000000
25% 3.039534e+07 -37.854127 144.956592 2.000000 1.000000 1.000000 99.000000 1.000000 1.000000
50% 9.881559e+07 -37.819410 144.977188 3.000000 1.000000 2.000000 149.000000 2.000000 7.000000
75% 2.628817e+08 -37.802066 145.022396 4.000000 2.000000 3.000000 225.000000 3.000000 32.000000
max 5.351276e+08 -37.484800 145.844997 16.000000 14.000000 31.000000 2000.000000 1125.000000 987.000000

Investigating through the boxplot we find some extreme values.

In [73]:
# Create the graph
fig = px.box(data_v3, x='minimum_nights')
fig.update_layout(
    title='Boxplot analysis variable - minimum_nights',
    title_x=0.5,
    autosize=False,
    width=800,
    height=500
)

# Show the boxplot
fig.show()

Here, I'm going to filter the data where minimum_nights are more than 365.

In [74]:
# Filter 'minimum_nights' column with values above 365
data_v3[data_v3['minimum_nights'] > 365].head(10)
Out[74]:
listing_url name host_id neighbourhood_cleansed latitude longitude room_type accommodates bathrooms_text bedrooms beds amenities price minimum_nights number_of_reviews
110 https://www.airbnb.com/rooms/900838 Loft in Richmond · ★4.88 · 1 bedroom · 1 bed ·... 4812182 Yarra -37.82286 144.99065 Entire home/apt 2 1 bath 1 1 ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... 232.0 803 141
590 https://www.airbnb.com/rooms/4632666 Rental unit in Melbourne · ★4.62 · 2 bedrooms ... 4924814 Melbourne -37.81572 144.95447 Entire home/apt 5 1 bath 2 3 ["Microwave", "BBQ grill", "Free street parkin... 149.0 700 165
650 https://www.airbnb.com/rooms/4844270 Rental unit in South Yarra · 1 bedroom · 1 bed... 21300152 Melbourne -37.82967 144.98577 Entire home/apt 2 1 bath 1 1 ["Wifi", "Kitchen", "Iron", "Hangers", "Essent... 125.0 500 1
1974 https://www.airbnb.com/rooms/10770086 Rental unit in Travancore · 1 bedroom · 1 bed ... 54793045 Melbourne -37.78452 144.93970 Private room 2 1 bath 1 1 ["Smoke alarm", "Wifi", "Kitchen", "Gym", "Iro... 34.0 900 1
2922 https://www.airbnb.com/rooms/14833595 Rental unit in Melbourne · ★4.69 · 3 bedrooms ... 4924814 Melbourne -37.81624 144.95495 Entire home/apt 9 2 baths 3 5 ["Microwave", "Free street parking", "High cha... 269.0 600 125
2950 https://www.airbnb.com/rooms/14581093 Rental unit in Southbank · ★4.78 · 2 bedrooms ... 4924814 Melbourne -37.82821 144.95896 Entire home/apt 6 2 baths 2 3 ["Free street parking", "High chair", "Hot wat... 269.0 1124 75
2971 https://www.airbnb.com/rooms/14893028 Rental unit in Melbourne · ★4.76 · 3 bedrooms ... 4924814 Melbourne -37.81532 144.95331 Entire home/apt 8 2 baths 3 5 ["Microwave", "Free street parking", "Lockbox"... 299.0 600 137
4570 https://www.airbnb.com/rooms/21035543 Home in Wantirna · 1 bedroom · 1 bed · 1 priva... 42784943 Knox -37.84083 145.22844 Private room 2 1 private bath 1 1 ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... 40.0 1000 1
4669 https://www.airbnb.com/rooms/21194478 Rental unit in Southbank · ★4.70 · 3 bedrooms ... 4924814 Melbourne -37.82727 144.96079 Entire home/apt 9 2 baths 3 5 ["Microwave", "Free street parking", "High cha... 209.0 700 55
4968 https://www.airbnb.com/rooms/21792086 Rental unit in Melbourne · ★4.30 · 1 bedroom ·... 156159929 Melbourne -37.80936 144.96001 Private room 2 1 shared bath 1 1 ["Pool", "Wifi", "Kitchen", "Hangers", "Essent... 55.0 1000 52

After inspecting 'minimum_nights' values above 365, I discovered that most are outdated, some appears to be duplicates, and others are not available for check in. There's only one value that appears valid but with outdated information. I will update this value and remove the rest.

In [75]:
# Update the 'minimum_nights' value equal to 999 with 2, removing the values above 365 and reseting the DataFrame index
data_v3.loc[(data_v3['minimum_nights'] == 999), 'minimum_nights'] = 2
data_v3 = data_v3.loc[(data_v3['minimum_nights'] <= 365)].reset_index(drop=True)

Once again, the boxplot reveals a significant number of extreme values. Despite their extremity, these were not categorized as outliers.

In [76]:
# Create the graph
fig = px.box(data_v3, x='minimum_nights')
fig.update_layout(
    title='Boxplot analysis variable - minimum_nights',
    title_x=0.5,
    autosize=False,
    width=800,
    height=500
)

# Show the boxplot
fig.show()

As the 'name' and 'listing_url' columns don't provide more significant information, we'll exclude them using the .drop() method with the 'columns' argument, specifying the columns to be removed.

In [77]:
# Drop the 'name' and 'listing_url' columns
data_v3 = data_v3.drop(columns=['name', 'listing_url'])

Now I'm going to make another DataFrame backup.

In [78]:
# Make DataFrame backup
data_v4 = data_v3.copy()

I will export the DataFrame to a CSV file using the .to_csv() method. It takes two arguments: the path where the file will be saved and a boolean value to avoid writing row indices. I will use this CSV file to create a Tableau dashboard.

In [79]:
# Export DataFrame
data_v4.to_csv('listings_cleaned.csv', index=False)

5. Exploratory Data Analysis¶

In EDA (Exploratory Data Analysis), I'm going to summarize, visualize, and interpret data, which can help in identifying patterns and trends, so that we can find insights that answer our questions. In this stage, I'm going to perform Univariate, Bivariate and Correlation Analysis.

Univariate Analysis¶

Univariate Analysis involves studying a single data variable. This analysis can help us understand the data's characteristics, such as the distribution, central tendency, and dispersion. It's often the first step in data analysis, providing valuable insights before moving on to bivariate analysis. So, let's answer our questions!

1. Who are the top 10 hosts with the most listings?¶

In [80]:
# Create a frequency table for the top 10 hosts
top_10_hosts = pd.DataFrame({
    'Host ID': data_v4['host_id'].value_counts().head(10).index,
    'Number of Listings': data_v4['host_id'].value_counts().head(10).values,
    'Percentage of Listings': np.round((data_v4['host_id'].value_counts().head(10) / data_v4.shape[0]) * 100, 2)
})

# Reset the index to get a standard numeric index
top_10_hosts.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1 instead of 0
top_10_hosts.index = top_10_hosts.index + 1

# Display the frequency table
top_10_hosts
Out[80]:
Host ID Number of Listings Percentage of Listings
1 90729398 327 1.42
2 279001183 250 1.08
3 343442154 154 0.67
4 1739996 98 0.42
5 412803321 80 0.35
6 22860147 80 0.35
7 446080599 77 0.33
8 760849 73 0.32
9 392306676 69 0.30
10 10190138 66 0.29

The data reveals that the top 10 hosts with the most listings are led by host 90729398 with 327 listings (1.42%) of total. The next two hosts, 279001183 and 343442154, have 250 (1.08%) and 154 (0.67%) listings respectively. The remaining seven hosts each have less than 100 listings, contributing less than 0.5% to the total each. For a clearer view of the results, I've created the following chart.

In [81]:
# Filter the top 10 hosts
top_10_hosts = data_v4['host_id'].astype(str).value_counts().head(10).sort_values()

# Create the graph
fig = go.Figure(data=[go.Bar(
    x=top_10_hosts.values, y=top_10_hosts.index,
    orientation='h', text=top_10_hosts.values, textposition='auto',
    hovertemplate='Host ID: %{y}<br>Number of Listings: %{x}<extra></extra>', 
    marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(
    autosize=False, title='<b>Top 10 Hosts Listings</b>', title_x=0.5,
    width=800, height=500,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, 
    xref='paper', yref='paper', x=1, y=-0.2, 
    bgcolor='white', font=dict(size=12, color='black')
)

# Show the chart
fig.show()

2. Where are the top 10 Melbourne Airbnbs listings located?¶

In [82]:
# Create a frequency table for the top 10 neighbourhoods
top_10_neighbourhoods = pd.DataFrame({
    'Neighbourhood': data_v4['neighbourhood_cleansed'].value_counts().head(10).index,
    'Number of Listings': data_v4['neighbourhood_cleansed'].value_counts().head(10).values,
    'Percentage of Listings': np.round((data_v4['neighbourhood_cleansed'].value_counts().head(10) / data_v4.shape[0]) * 100, 2)
})

# Reset the index to get a standard numeric index
top_10_neighbourhoods.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1 instead of 0
top_10_neighbourhoods.index = top_10_neighbourhoods.index + 1

# Display the frequency table
top_10_neighbourhoods
Out[82]:
Neighbourhood Number of Listings Percentage of Listings
1 Melbourne 7246 31.40
2 Port Phillip 2723 11.80
3 Yarra 1771 7.67
4 Stonnington 1501 6.50
5 Yarra Ranges 1106 4.79
6 Merri-bek 940 4.07
7 Monash 590 2.56
8 Boroondara 586 2.54
9 Whitehorse 582 2.52
10 Wyndham 575 2.49

The data reveals that the majority of Airbnb listings in Melbourne are concentrated in the Melbourne neighbourhood itself, accounting for 31.40% of the total listings. The next two neighbourhoods, Port Phillip and Yarra, have 11.80% and 7.67% of the listings respectively. The remaining seven neighbourhoods each have less than 7% of the total listings, indicating a more dispersed distribution of Airbnb listings in these areas.

In [83]:
# Filter the top 10 neighbourhoods
top_10_neighbourhoods = data_v4['neighbourhood_cleansed'].value_counts().head(10).sort_values()

# Create the graph
fig = go.Figure(data=[go.Bar(
    x=top_10_neighbourhoods.values, y=top_10_neighbourhoods.index,
    orientation='h', text=top_10_neighbourhoods.values, textposition='auto',
    hovertemplate='Neighbourhood: %{y}<br>Number of Listings: %{x}<extra></extra>', 
    marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(
    autosize=False, title='<b>Top 10 Neighbourhoods with Most Listings</b>', title_x=0.5,
    width=800, height=500,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, 
    xref='paper', yref='paper', x=1, y=-0.2, 
    bgcolor='white', font=dict(size=12, color='black')
)

# Show the chart
fig.show()

Here's a graph map displaying the results.

In [84]:
# Load the GeoJSON file
gdf = gpd.read_file('/kaggle/input/melbourne-airbnb-september-2023-dataset/neighbourhoods.geojson')

# Count the number of listings and add to the GeoDataFrame
gdf['listings'] = gdf['neighbourhood'].map(data_v4['neighbourhood_cleansed'].value_counts())

# Filter the 10 neighbourhoods with the most listings
top_10_neighbourhoods = gdf.nlargest(10, 'listings')

# Create a new column with the neighbourhood names
top_10_neighbourhoods['Neighbourhood'] = top_10_neighbourhoods['neighbourhood']

# Create the interactive map plot
fig = px.choropleth_mapbox(top_10_neighbourhoods, geojson=gdf.geometry, 
                           locations=top_10_neighbourhoods.index, color='listings',
                           hover_data={'Neighbourhood':True, 'listings':True},
                           color_continuous_scale=[(0, '#FFFFFF'), (1, '#ff4c5e')], # Custom colorscale
                           mapbox_style="open-street-map",
                           zoom=10, center = {"lat": -37.8136, "lon": 144.9631}, opacity=0.5,
                           labels={'listings':'Number of Listings'})

# Customize the layout
fig.update_layout(
    autosize=False, width=800, height=500, margin={"r":0,"t":0,"l":0,"b":0},
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))

# Update the hover template
fig.update_traces(hovertemplate='<b>Neighbourhood: %{customdata[0]}</b><br><b>Number of Listings: %{z}</b>')

# Show the chart
fig.show()

3. How are the room type distributed in the listings?¶

In [85]:
# Create a frequency table for room types
room_type = pd.DataFrame({
    'Room Type': data_v4['room_type'].value_counts().head().index,
    'Number of Listings': data_v4['room_type'].value_counts().head().values,
    'Percentage of Listings': np.round((data_v4['room_type'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})

# Reset the index to get a standard numeric index
room_type.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1 instead of 0
room_type.index = room_type.index + 1

# Display the frequency table
room_type
Out[85]:
Room Type Number of Listings Percentage of Listings
1 Entire home/apt 16803 72.80
2 Private room 5911 25.61
3 Shared room 277 1.20
4 Hotel room 89 0.39

The room type in the listings are distributed in 'Entire home/apt' with 16.803 listings, accounting for (72.80%) of the total. This is followed by 'Private room' with 5.911 listings (25.61%), 'Shared room' with 277 listings (1.20%), and 'Hotel room' with 89 listings (0.39%).

In [86]:
# Filter room types and calculate their percentage
room_type = (data_v4['room_type'].value_counts() / data_v4.shape[0] * 100).round(2)

# Create the graph
fig = go.Figure(data=[go.Bar(
    x=room_type.index, y=room_type.values, 
    text=room_type.values, textposition='auto',
    hovertemplate='Room type: %{x}<br>Number of Listings: %{text}%<extra></extra>', 
    marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(
    autosize=False, title='<b>Room Type Distribution</b>', title_x=0.5,
    width=800, height=500,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, 
    xref='paper', yref='paper', x=1, y=-0.2, 
    bgcolor='white', font=dict(size=12, color='black')
)

# Show the chart
fig.show()

4. What are the 5 most common numbers of bedrooms in the listings?¶

In [87]:
# Create a frequency table for the top 5 bedroom counts
top_5_bedrooms = pd.DataFrame({
    'Number of Bedroom': data_v4['bedrooms'].value_counts().head().index,
    'Number of Listings': data_v4['bedrooms'].value_counts().head().values,
    'Percentage of Listings': np.round((data_v4['bedrooms'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})

# Reset the index to get a standard numeric index
top_5_bedrooms.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1 instead of 0
top_5_bedrooms.index = top_5_bedrooms.index + 1

# Display the frequency table
top_5_bedrooms
Out[87]:
Number of Bedroom Number of Listings Percentage of Listings
1 1 11386 49.33
2 2 6926 30.01
3 3 2527 10.95
4 4 1051 4.55
5 0 748 3.24

The data reveals that the majority of listings have one bedroom, accounting for nearly half of the total with 11,386 listings, or (49.33%). Two-bedroom listings are the next most common, making up (30.01%) with 6,926 of the listings. As we move to larger properties, the numbers decrease, with 3-bedroom and 4-bedroom making up (10.95%) with 2,527 listings and (4.55%) with 1,051 listings respectively. The least common are studios, or 0-bedroom properties, which make up (3.24%) with 748 listings.

In [88]:
# Filter bedrooms booked and calculate their percentage
bedrooms_booked = (data_v4['bedrooms'].value_counts().head() / data_v4.shape[0] * 100).round(2)

# Create the graph
fig = go.Figure(data=[go.Bar(
    x=bedrooms_booked.index.astype(str), y=bedrooms_booked.values, 
    text=bedrooms_booked.values, textposition='auto',
    hovertemplate='Number of Bedroom: %{x}<br>Number of Listings: %{text}%<extra></extra>', 
    marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(
    autosize=False, title='<b>The 5 Most Common Bedroom</b>', title_x=0.5,
    width=800, height=500,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)

# Add an annotation for clarity
fig.add_annotation(
    text='<b>Note: Bedroom = 0 (Studio)</b>', align='right', showarrow=False, 
    xref='paper', yref='paper',
    x=1, y=-0.14, 
    bgcolor='white', font=dict(size=12, color='black')
)

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, 
    xref='paper', yref='paper', x=1, y=-0.2, 
    bgcolor='white', font=dict(size=12, color='black')
)

# Show the chart
fig.show()

5. What are the 5 most frequent minimum_nights?¶

In [89]:
# Create a frequency table for the most common minimum nights
most_minimum_nights = pd.DataFrame({
    'Minimum Nights': data_v4['minimum_nights'].value_counts().head().index,
    'Number of Listings': data_v4['minimum_nights'].value_counts().head().values,
    'Percentage of Listings': np.round((data_v4['minimum_nights'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})

# Reset the index to get a standard numeric index
most_minimum_nights.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1 instead of 0
most_minimum_nights.index = most_minimum_nights.index + 1

# Display the frequency table
most_minimum_nights
Out[89]:
Minimum Nights Number of Listings Percentage of Listings
1 1 7895 34.21
2 2 7325 31.74
3 3 3008 13.03
4 7 1257 5.45
5 5 1021 4.42

The five most frequent minimum nights are 1, 2, and 3 which together account for 78.98% of the total listings. This suggest that most hosts accommodates short-term stays and cater to travelers who are looking for flexibility and convinience. On the other hand, listings with minimum nights of 5 or more are relatively rare. This shows that hosts who prefer long-term stays are less common.

In [90]:
# Create a frequency table for the most common minimum nights
most_minimum_nights = pd.DataFrame({
    'Minimum Nights': data_v4['minimum_nights'].value_counts().head().index,
    'Number of Listings': data_v4['minimum_nights'].value_counts().head().values,
    'Percentage of Listings': np.round((data_v4['minimum_nights'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})

# Reset the index to get a standard numeric index
most_minimum_nights.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1 instead of 0
most_minimum_nights.index = most_minimum_nights.index + 1

# Create the graph
fig = go.Figure(data=[go.Bar(
    x=most_minimum_nights['Minimum Nights'].astype(str), y=most_minimum_nights['Number of Listings'], 
    text=most_minimum_nights['Number of Listings'], textposition='auto',
    hovertemplate='Minimum Nights: %{x}<br>Number of Listings: %{y}<extra></extra>', 
    marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(
    autosize=False, title='<b>The 5 Most Minimum Nights</b>', title_x=0.5,
    width=800, height=500,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, 
    xref='paper', yref='paper', x=1, y=-0.2, 
    bgcolor='white', font=dict(size=12, color='black')
)

# Show the chart
fig.show()

6. What are the four amenities most commonly provided?¶

In [91]:
# Transform the string series into a list series and counting the total of each amenity
amenities = [item for sublist in data_v4['amenities'].apply(ast.literal_eval) for item in sublist]
quantity_amenities = pd.Series(amenities).value_counts().head(4)

# Create the frequency table
quantity_amenities = pd.DataFrame({
        'Amenity': quantity_amenities.index,
        'Number of Amenities': quantity_amenities.values,
        'Percentage of Amenities': np.round((quantity_amenities.values / data_v4.shape[0]) * 100, 2)
})

# Reset the index to get a standard numeric index
quantity_amenities.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1
quantity_amenities.index = quantity_amenities.index + 1

# Display the frequency table
quantity_amenities
Out[91]:
Amenity Number of Amenities Percentage of Amenities
1 Kitchen 21865 94.74
2 Smoke alarm 21605 93.61
3 Wifi 20307 87.99
4 Essentials 20017 86.73

The data shows that the most popular amenities are those that are essential for a comfortable and safe stay, such as kitchen, smoke alarm, wifi, and essentials. These amenities are provided by more than 85% of the listings, indicating that they are highly valued by both hosts and guests.

In [92]:
# Sort the DataFrame in ascending order by 'Number of Amenities'
quantity_amenities = quantity_amenities.sort_values('Number of Amenities', ascending=True)

# Create the bar chart
fig = go.Figure(go.Bar(
    x=quantity_amenities['Number of Amenities'], y=quantity_amenities['Amenity'], orientation='h', 
    text=quantity_amenities['Number of Amenities'], textposition='auto',
    hovertemplate='Amenities: %{y}<br>Number of Amenities: %{x}<extra></extra>', marker_color='#f5b7b9'
))

# Customize the layout
fig.update_layout(
    autosize=False, title_text='<b>4 Amenities Most Commonly Provided</b>', title_x=0.5,
    width=800, height=500, hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
    x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))

# Show the chart
fig.show()

Bivariate Analysis¶

Bivariate analysis is a statistical method that helps you investigate relationships bewtween two different variables. This analysis can provide valuable insights into how these variables interact with each other.

7. Which room type in Melbourne's Airbnb listings has the highest number of reviews?¶

In [93]:
# Create a frequency table for room types and their reviews
room_type_reviews = pd.DataFrame({
    'Room Type': data_v4['room_type'].value_counts().index,
    'Number of Reviews': data_v4.groupby('room_type')['number_of_reviews'].sum(),
    'Percentage of Reviews': np.round(data_v4.groupby('room_type')['number_of_reviews'].sum() / data_v4['number_of_reviews'].sum() * 100, 2)
})

# Sort the values in the frequency table by 'Number of Reviews'
room_type_reviews = room_type_reviews.nlargest(4, 'Number of Reviews').reset_index(drop=True)

# Display the frequency table
room_type_reviews
Out[93]:
Room Type Number of Reviews Percentage of Reviews
0 Entire home/apt 613773 86.20
1 Shared room 94852 13.32
2 Private room 1765 0.25
3 Hotel room 1674 0.24

The data shows that the room type Entire home/apt has the highest number of reviews, accounting for 86.20% of the total reviews. This suggests that this room type is the most popular and preferred among guests, as they are likely to leave more feedback when they are satisfied with their stay.

In [94]:
# Create the graph
fig = go.Figure(data=[go.Bar(x=room_type_reviews['Room Type'], y=room_type_reviews['Number of Reviews'],
                             text=room_type_reviews['Number of Reviews'], textposition='auto',
                             hovertemplate='Room Type: %{x}<br>Number of Reviews: %{y}<extra></extra>', marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(title='<b>Room Type with the Highest Reviews</b>', title_x=0.5,
                   width=800, height=500,
                   hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))

# Add the source
fig.add_annotation(text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
                    x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))

# Show the chart
fig.show()

8. The average number of amenities varies significantly across different room types?¶

In [95]:
# Group by room type and calculating the average number of amenities
room_type_amenities = data_v4['room_type'].value_counts().index.to_series().map(lambda x: data_v4[data_v4['room_type'] == x]['amenities'].apply(lambda y: len(ast.literal_eval(y))).mean())

# Create a DataFrame
room_type_amenities = pd.DataFrame({'Room Type': room_type_amenities.index, 
                                    'Average Amenities': room_type_amenities.values}).round(2).sort_values('Average Amenities', ascending=False)

# Reset the index to start from 1
room_type_amenities.reset_index(drop=True, inplace=True)

# Adjust the index to start at 1
room_type_amenities.index = room_type_amenities.index + 1

# Display the DataFrame
room_type_amenities
Out[95]:
Room Type Average Amenities
1 Entire home/apt 35.14
2 Private room 24.00
3 Hotel room 23.89
4 Shared room 18.20

These averages indicate a noticeable variation in the number of amenities across different room types. 'Entire home/apt' listings provide the most amenities on average 35.15, while 'Shared room' provide the fewest 18.20. This trend aligns with our hypothesis that the average number of amenities provided varies by room type.

In [96]:
# Create the graph
fig = go.Figure(data=[go.Bar(
    x=room_type_amenities['Room Type'], y=room_type_amenities['Average Amenities'],
    text=room_type_amenities['Average Amenities'], textposition='auto', hovertemplate='Room Type: %{x}<br>Average Amenities: %{y}<extra></extra>',
    marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(
    title='<b>Average Amenities by Room Type</b>', title_x=0.5, width=800, height=500,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper', x=1, y=-0.2,
    bgcolor='white', font=dict(size=12, color='black'
))

# Show the chart
fig.show()

9. Is the average price per night higher for private room than for shared rooms?¶

In [97]:
# Create a table with the average for each room type
average_room_type = pd.DataFrame(
    round(data_v4[data_v4['room_type'].isin(data_v4['room_type'].value_counts().head(5).index)]
          .groupby('room_type')['price'].mean(),2)).reset_index().rename(columns={'room_type': 'Room Type', 'price': 'Average Price'}).sort_values('Average Price', ascending=True)

# Reset the index to start from 1
average_room_type = average_room_type.reset_index(drop=True)

# Adjust the index to start at 1
average_room_type.index = average_room_type.index + 1

# Show the table
average_room_type
Out[97]:
Room Type Average Price
1 Private room 100.22
2 Shared room 117.17
3 Entire home/apt 232.23
4 Hotel room 243.19

The hypothesis predicts that the average price per night for private rooms would be higher than for shared rooms. However, the data shows the average price for a private room $100.22 is actually lower than for a shared room $117.17. Therefore, this specific set of data does not support this hypothesis.

In [98]:
# Create the graph
fig = go.Figure(data=[go.Bar(
    x=average_room_type['Room Type'], y=average_room_type['Average Price'],
    text=average_room_type['Average Price'], textposition='auto',
    hovertemplate='Room Type: %{x}<br>Average Price: %{y}<extra></extra>', marker_color='#f5b7b9'
)])

# Customize the layout
fig.update_layout(
    title='<b>Average Prices for each Room Type</b>', title_x=0.5,
    width=800, height=500,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
    x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))

# Show the chart
fig.show()

Correlation Analysis¶

Correlation analysis is a specific type of bivariate analysis that measures the strength and direction of the linear relationship between two variables. It results in a correlation coefficient, a value between -1 and 1. A correlation coefficient close to 1 indicates a strong positive relationship, a coefficient close to -1 indicates a strong negative relationship, and a coefficient close to 0 indicates no linear relationship.

To perform this analysis, first I will select quantitative variables using the .select_dtypes() method. This method will include int64 and float64 data types. Then, I will use the .iloc method to select only the variables of interest.

In [99]:
# Select quantitative variables (without latitude and longitude)
quantitative_variables = data_v4.select_dtypes(include=['int64', 'float64'])
quantitative_variables = quantitative_variables.iloc[:, 3:]

# Preview the 3 rows
quantitative_variables.head(3)
Out[99]:
accommodates bedrooms beds price minimum_nights number_of_reviews
0 2 1 1 49.0 5 173
1 2 1 1 95.0 3 42
2 5 3 3 116.0 1 228

Now, I'm going to calculate and display the Pearson coefficient matrix of the quantitative variables. The correlation matrix is a table that shows the correlation coefficients between each pair of variables.

In [100]:
# Calculate the Pearson coeficient
correlation_matrix = quantitative_variables.corr(method='pearson')

# Exhibit correlation_matriz
correlation_matrix
Out[100]:
accommodates bedrooms beds price minimum_nights number_of_reviews
accommodates 1.000000 0.811764 0.861758 0.482950 -0.018598 0.022064
bedrooms 0.811764 1.000000 0.785499 0.470027 -0.003285 -0.047187
beds 0.861758 0.785499 1.000000 0.438041 -0.005333 -0.000953
price 0.482950 0.470027 0.438041 1.000000 0.022074 -0.047232
minimum_nights -0.018598 -0.003285 -0.005333 0.022074 1.000000 -0.027315
number_of_reviews 0.022064 -0.047187 -0.000953 -0.047232 -0.027315 1.000000

Here I'm going to create a heatmap of the correlation matrix. A heatmap is a graphical representation of data where the values are shown by different colors. This helps visualize the strength and direction of the correlations between the variables.

In [101]:
# Get correlation matrix
z = correlation_matrix.values.round(2)

# Get variable names
x = correlation_matrix.columns
y = correlation_matrix.index

# Create the heatmap
heatmap = go.Figure(data=go.Heatmap(
                   z=z, x=x, y=y, colorscale=[(0, '#FFFFFE'), (1, '#ff4c5e')],
                   hoverongaps = False, hoverinfo='z'
))

# Customize the layout
heatmap.update_layout(
    title='<b>Correlation Matrix</b>', title_x=0.5,
    width=800, height=500,
    hoverlabel=dict(font_size=16, font_family="Rockwell"
))

# Add the source
heatmap.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
    x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))

# Add annotations (values) to each cell
for i in range(len(y)):
    for j in range(len(x)):
        # Round the value to 2 decimal places
        value = round(z[i][j], 2)
        heatmap.add_annotation(dict(
            x=x[j], y=y[i], text=str(value),
            showarrow=False, font=dict(size=12)
))

# Show the plot
heatmap.show()

10. Properties with more reviews might be priced higher due to their popularity?¶

In [102]:
# Calculate the correlation
correlation_price_review = data_v4[['price', 'number_of_reviews']].corr().round(2)

# Display the correlation matrix
correlation_price_review
Out[102]:
price number_of_reviews
price 1.00 -0.05
number_of_reviews -0.05 1.00

The correlation analysis revealed a weak negative correlation of -0.05 between the number of reviews and the price. This suggests that properties with more reviews do not necessarily have higher prices. In fact, the slight negative correlation indicates that properties with more reviews might be priced slightly lower, although the correlation is so weak that it's not a strong relationship.

In [103]:
# Create a scatter plot of 'number_of_reviews' vs 'price' using Plotly graph objects
scatter = go.Scatter(
    x=data_v4['number_of_reviews'], y=data_v4['price'], mode='markers', 
    hovertemplate='Number of Reviews: %{x}<br>Price: %{y}<extra></extra>', marker_color='#f5b7b9'
)

# Customize the layout
layout = go.Layout(
    title='<b>Price vs Number of Reviews</b>', title_x=0.5, 
    xaxis_title='<b>Number of Reviews</b>', yaxis_title='<b>Price</b>',
    width=800, height=500,
    hoverlabel=dict(bgcolor='white', font_size=16, font_family="Rockwell"
))

# Create the figure and adding the scatter plot
fig = go.Figure(data=scatter, layout=layout)

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
    x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))

# Show the chart
fig.show()

11. Listings with more amenities might be priced higher?¶

In [104]:
# Calculate the number of amenities for each listing
num_amenities = data_v4['amenities'].apply(lambda x: len(x))

# Check the correlation using the .corr() method, using as argument the Pearson method
pd.concat([num_amenities, data_v4['price']], axis=1).corr(method='pearson').round(2)
Out[104]:
amenities price
amenities 1.00 0.15
price 0.15 1.00

The hypothesis states that listings with more amenities might be priced higher. The answer shows that the correlation coefficient between the number of amenities and the price is 0.15, which indicates a weak positive relationship. That means that there is a slight tendenct for listings with more amenities to have higher prices, but it is not a strong or consistent pattern.

In [105]:
# Create a scatterplot to visualize the relationship using Plotly's graph_objects
fig = go.Figure(data=go.Scatter(
    x=num_amenities, y=data_v4['price'], mode='markers', 
    hovertemplate='Number of Amenities: %{x}<br>Price: %{y}<extra></extra>', marker_color='#f5b7b9'
))

# Customize the layout
fig.update_layout(title='<b>Price vs Number of Amenities</b>', title_x=0.5, 
                  xaxis_title='<b>Number of Amenities</b>', yaxis_title='<b>Price</b>',
                  width=800, height=500,
                  hoverlabel=dict(bgcolor='white', font_size=16, font_family="Rockwell"
))

# Add the source
fig.add_annotation(
    text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
    x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))

# Show the chart
fig.show()

Summary of the analysis:

  • The distribution of listings among hosts is highly varied, with the top host having nearly five times as many listings as the 10th host. Despite this, the top 10 hosts account for a relatively small percentage of the total listings, suggesting a diverse and broad range of hosts.

  • The data indicates a pronounced concentration of Airbnb listings in the Melbourne neighbourhood, with a significant drop in the number of listings in other neighbourhoods. This could suggest a higher demand for Airbnb accommodations in the Melbourne neighbourhood compared to others. Further analysis could provide insights into the factors influencing this distribution, such as tourist attractions and accessbility in these neighbourhoods.

  • In summary, the data suggests that in Melbourne's Airbnb accommodations, guests predominantly prefer the comfort and privacy of renting entire homes/apartments or private rooms, which constitute over 98% of all listings. The less common categories, such as shared rooms and hotel rooms, account for just 1.59% of listings, indicating they're less favored.

  • The data shows that one and two-bedroom accommodations accounts for 49.33% and 30.01% of the listings respectively. These findings suggest that most Airbnb guests in Melbourne prefer smaller and more affordable accommodations, possibly for solo travelers or couples. Listings with three or four bedrooms are much less prevalent, around 15% of the listings, indicating a lower demand. Lastly, we have 3.24% of the listings with no bedroom, or known as studio, being a more compact and economic option for guests.

  • In summary, the minimum nights distribuition revelas that most Melbourne Airbnb listings are suitable for short-term stays, while long-term stays are less available and less popular.

  • The data reveals that the most common amenities in Melbourne’s Airbnb listings are kitchen, smoke alarm, wifi, and essentials, which are considered as basic and necessary for a pleasant and secure stay.

  • In summary, the room type Entire home/apt in Melbourne’s Airbnb listings has the highest number of reviews, implying that it is the most in-demand and most satisfying type of accommodation. The other room types, such as Shared room, Private room, and Hotel room, have much lower numbers and percentages of reviews, suggesting that they are less available and less attractive to guests.

  • The data reveals a distinct variation in the average number of amenities across different room types. Specifically, listings categorized as ‘Entire home/apt’ offer the highest average number of amenities, approximately 35.14, whereas ‘Shared room’ listings offer the least, around 18.20. This trend supports our initial hypothesis that room type influences the average number of amenities provided. However, to establish this observation as a statistically significant fact, further analysis, such as an ANOVA test would be required.

  • In our analysis of room types and their average prices, we tested hypothesis number 9. Our data showed that the average price for a private room $100.22 was actually lower than for a shared room $117.17, indicating that the data did not suporrt hypothesis number 9.

  • The hypothesis that properties with more reviews might be priced higher due to their popularity was not supported by the data. The correlation analysis showed a weak negative correlation of -0.05 between the number of reviews and the price.

  • Our analysis examined the correlation between the number of amenities in a listing and its price. We hypothesized that more amenities would correlate with higher prices. The data, however, showed a weak positive correlation of 0.15, indicating only a slight increase in price with more amenities, and not a consistent trend.

Please note that this preliminary analysis is based solely on the provided data and does not consider other potentially influential factors. For a more comprehensive understanding of the situation, it would be advisable to incorporate additional aspects such as geographical location, time of year, and local regulations into the analysis.

Dashboard¶

The Dashboard is available here.

Conclusion¶

Based on our problem statement and questions, we have the following conclusion:

Melbourne's Airbnb market is characterized by its diversity, with a wide range of hosts contributing to a rich variety of listings. The majority of these listings are found in the Melbourne neighbourhood, indicating a high demand in this area.

Guests show a clear preference for renting entire homes or apartments, as well as private rooms, with smaller one or two-bedroom accommodations being particularly popular. These listings are predominantly suitable for short-term stays, catering to the needs of transient visitors.

In terms of amenities, the basics such as a kitchen, smoke alarms, Wi-Fi, and essentials are commonly provided across all listings. Among the various types of accommodations, 'Entire home/apt' listings stand out as the most popular and offer the most amenities.

Interestingly, despite offering more personal space, private rooms are generally cheaper than shared rooms. Furthermore, having more reviews or amenities does not necessarily equate to higher prices, suggesting that other factors may be at play in the pricing decisions.

These findings provide a comprehensive overview of the key characteristics of Airbnb listings in Melbourne. However, they are based on the current data and may change over time.

Thanks for your time!¶